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Preface 


Since its first meeting twenty-six years ago, the IDEAS conference was always—except for the 
past two years—a place that brought together people who share an interest in data 
engineering, applications, and science, as well as a passion of sharing that science with like- 
minded fellow scientists. Fortunately, the Covid pandemic has subsided to the level that 
Hungary was classified as a safe travel destination by the US Centers for Disease Control and 
Prevention, and it was possible to meet again in person for most participants, while others 
participated remotely. The conference also has a tradition of being held in various countries 
and continents to increase its visibility. This time the conference was held at the ELTE 
University, in Budapest, Hungary, a site that was chosen by the conference steering committee 
over a year ago. Since that choice was made, some unforeseen difficulties emerged and posed 
new challenges, including the tragic war in neighboring Ukraine. The uncertainties led us to 
schedule the conference for later than usual in the summer, for August 22-24. 


The conference covered many topics including big data, block chain, data analytics, machine 
learning, OLAP, and watermarking. The invited presentations by Prof. Andras Benczur, former 
Dean of Sciences, ELTE University, and Prof. Schahram Dustdar, Director of Distributed Systems 
Research at TU Wien, were some of the highlights of the conference. 


We would like to take this opportunity to thank the members of our program committee, listed 
here, for their help in the review process. The conference received 38 regular papers and one 
invited paper that was also reviewed by the program committee. All the submitted papers were 
assigned to four reviewers, and all program committee members’ papers received double blind 
reviews. The proceedings consist of 1 invited paper, 16 full papers (acceptance rate 42%), and 6 
short papers (16%). 


This conference would not have been possible without the help and effort of many people and 
organizations. We would like to express our appreciation to the following people: 

-ACM (Anna Lacson, Craig Rodkin, and Barbara Ryan), 

-BytePress, ConfSys.org, Concordia University (Will Knight and Gerry Laval), 

-ELTE University and the local organization of the conference (Prof. Attila Kiss and Agnes Kerek), 


-Many other people and support staff, who contributed selflessly and have been involved in 
organizing and holding this event. 


We greatly appreciate their efforts and dedication to the conference. 


Peter Z. Revesz, Program Committee Chair, IDEAS 2022 
Professor, School of Computing, University of Nebraska-Lincoln, USA 


Lincoln, Nebraska, August 2022 
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Distinguishing Fake and Real News of Twitter Data with the help 
of Machine Learning Techniques 


Aanan Shah 


School of Engineering and Computer Science, Laurentian 


University, Sudbury, Ontario, Canada 
ashah2@laurentian.ca 


ABSTRACT 


News articles have an influence on people’s belief and views about 
various circumstances. In this regard, some news publishers with 
political or ideological bias try to spread news which are distorted or 
totally wrong. Natural language processing was used to preprocess 
the text. Some general features like, number of words, sentences, 
stopwords, non-alphabetic words, verbs, nouns, and adjectives were 
identified. Word positioning was labeled to distinguish a word as 
a noun, a pronoun, an adjective or a verb in the sentences. Pre- 
processing was followed by feature extraction methods namely, 
count vectorizer, Term Frequency-Inverse Document Frequency 
(TF-IDF) vectorizer and word2vec embedding. It was observed that 
the results obtained by TF-IDF feature extraction method were 
superior compared with the other two methods. Various machine 
learning models were used for training the model namely, Naive 
Bayes, Logistic Regression, Random Forest, K-nearest neighbors 
(KNN), Support Vector Machine (SVM) and Recurrent Neural Net- 
work (RNN) as a deep learning model. The models were successfully 
tested on two datasets. On the first dataset, SVM achieved an ac- 
curacy of 98.5% and RNN achieved an accuracy of 98.03% which 
is much improvement over the best results of Agarwalla et al., 
2019 (83.16 % accuracy). On the second dataset, SVM achieved an 
accuracy of 97.76%, RNN achieved 97.1% and Logistic Regression 
achieved 97.50% which is an improvement over the best results of 
Vijayraghavan et al. 2020 (94.88% accuracy). 


CCS CONCEPTS 


- Computing methodologies — Artificial intelligence; Search 
methodologies; Continuous space search. 


KEYWORDS 


Natural Language Processing (NLP), Feature extraction, count vec- 
torizer, TF-IDF vectorizer, Word2vec embedding, support vector 
machine (SVM), Random forest, Naive Bayes, K nearest neighbors 
(KNN), logistic regression, recurrent neural network (RNN) - LSTM 
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1 INTRODUCTION 


Social media platforms are used to interact with the people where 
millions of news varieties are uploaded daily and exchanged. These 
media sites include Facebook, Twitter, WhatsApp, and various oth- 
ers. The social information exchange mediums are somewhat un- 
predictable and unreliable at spreading the news to the community. 
Millions of posts are available on the social media sites,, it some- 
times becomes highly challenging for the users to predict if the 
news from the post is real or fake. Above all, the fake news not 
only affects business growth, it also targets celebrities, politicians, 
and different famous personalities. This task aims to spread rumors 
about the targets among the community and manipulate the facts. 
The situation gets worse when people blindly start to share the 
news even further. It not only exaggerates the situation but also 
results in stressful and unwelcoming outcomes. 

Along with advancements in the media sector and social net- 
working platforms, the emergence of Artificial Intelligence (AI) in 
the computing and research sector has revolutionized the world 
with its enormous applications into various fields of life [1]. The 
sub-fields of AI, such as machine learning and deep learning, pro- 
vide the learning algorithms to analyze text and big data. There 
is a vast majority of such fields, including medical, modeling and 
simulation, social networking analysis, language processing, graph, 
audio or video analytics, robotics, or visualization [2]. AI’s primary 
advantage is that it does not require complex modeling and design 
instead relies on simple equations, but the quality of the data for 
training and testing the predictive models for the required appli- 
cation must be highly accurate for good results. The fake news 
detection also has become more interesting with the involvement 
of AI. The problem of identifying news as a real or a fake belongs 
to the class of Natural Language Processing (NLP) [3] under the 
AI domain, The NLP algorithms are used to train the computer to 
read and decode the human language and extract valuable infor- 
mation out of it. These algorithms are machine learning and deep 
learning-based solutions to the applications of NLP. 

Existing research [4] on fake news detection shows the results 
using the twitter dataset, but the results are still not as good. This 
work on fake news detection was undertaken to improve the results 
on the same dataset and other datasets by implementing various 
machine learning techniques to increase accuracy. When the ac- 
curacy is high, people can easily trust the authenticity of news, 
which saves the community from fake news. This study has been 
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influenced by the work done in [5] and has undertaken some addi- 
tional techniques to develop a fake news detection mechanism by 
utilizing deep learning and the NLP approach. Fake News is a report 
that is deliberately a sham and hoodwinks readers. This tight defi- 
nition is essential as it can take out the vulnerability between Fake 
News and other related thoughts, e.g., manufactures and farces [6]. 
Agarwalla et al. [4] studied to examine how mass media influences 
the general public’s life and how it happens. The dataset was taken 
from kaggle.com in this research. The dataset dimension is 4008 
rows and 4 columns. "URLs", "Function", "Body" and "Name" are 
the names of the columns. The dataset contains 2136 false news 
storeys and 1872 genuine news stories. Naive Bayes classifier gave 
the maximum accuracy of 83.1% with Lidstone smoothing on the 
specified preparation set. However, 74% accuracy was achieved in 
previous models with Naive Bayes (without Lidstone smoothing) 
[4]. In those models Logistic Regression was used where the learn- 
ing rate (a) was the critical boundary. The learning rate between 
5 and 12 offered the same mixing point, and an approximation of 
10 was used. The model brought about exceptionally low accuracy 
65.8%. SVM achieved an accuracy of 81.6%. 

Looijenga [7] investigates how during the 2012 Dutch parlia- 
mentary election campaign, fake messages were used on Twitter. It 
examines the performance on a Twitter dataset of 8 guided Machine 
Learning classifiers. The authors claim that with an F1-Score of 88%, 
the Decision Tree performs the best on the used dataset. Out of 
613,033 tweets 328,897 were identified to be real and 284,136 were 
identified fake tweets. A further 150 tweets have been compiled 
from the corpus. These messages were being sent by bots or as 
being fake. Using the Camisani-Calzolari rule set [8], these mes- 
sages were labeled as bogus. Wynne and Wint [9] propose the Fake 
news identification framework that considers online news stories’ 
substance. They utilized word n-grams and character n-grams in 
their investigation. Gradient Boosting accomplished the highest 
precision of 96% when utilizing character trigram and character 
four-gram, TF-IDF at 10,000 highlights. Vijayaraghavan et al. [10] 
have studied fake news detection. In preprocessing, they have re- 
moved stop words, punctuation, digits, special characters, and URL 
links. Then they have compared the distribution of polarity sen- 
timent of the data before and after preprocessing. The texts were 
tagged to identify the position of the words. By drawing bar plots, 
they have shown that pronouns were used more in real news. In 
contrast, adverbs and adjectives were used more in fake news. In the 
feature extraction step, they have used word2vec embedding, word 
count vectorizer and TF-IDF vectorizer. They have considered fea- 
tures derived from unigram and bigram words in their word count 
vectorizer and TF-IDF vectorizer. Before implementing the classifi- 
cation, they have done outlier removal and fine-tuning to get proper 
tuning parameters for each classification model. In the classifica- 
tion analysis, Artificial Neural Network (ANN) and long-term short 
memory networks (LSTM), a special case of recurrent neural net- 
works (RNN), were used as deep learning methods. Other classifiers 
like support vector machines (SVM), random forest (RF), logistic 
regression (LR) was used. Performing 3-fold cross-validation, they 
show that the count vectorizer features could get an accuracy of 
94.88% in the long-term short memory model (LSTM). The highest 
accuracy found by word2vec embedding features was derived in 
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ANN as 93.06%. For TF-IDF features, the maximum accuracy was 
found in logistic regression as 94.79%. 

In this study machine learning model for fake news detection 
was developed using the dataset from kaggle.com [11] the same 
dataset which was used by (Agarwalla, et al 2019 [4]). The dataset 
includes labels for fake and real news. Two categories for fake and 
real news had close proportions, which made the dataset balanced 
between two categories with 53% fake news and 47% as real news. 
The results are compared with (Agarwalla et al. 2019) [4]. The ac- 
curacy of the SVM model (98%) was much higher compared with 
the maximum accuracy derived by (Agrawalla et al. 2019) [4] using 
Naive Bayes classifier with Lidstone smoothing (83%). Since they 
used only a small number of features (around 100 features) and 
1-gram, which considers each feature separately in the document. 
They also used a threshold of approximately 20% for each selected 
feature’s maximum document frequency. In this research, analysis 
was performed without using limit on the number of features and 
using (1,2) gram for the threshold of maximum document frequency. 
Another dataset from kaggle.com [12] was used to check whether 
the improvement seen in the first dataset was due to overfitting or 
whether it is an appropriate approach for increasing the classifi- 
cation performance models in fake news detection. The algorithm 
was found to be useful in improving the model accuracy for fake 
news detection in the second dataset as well. 


2 DATASET DESCRIPTION 


In this study, two different types of datasets were used to check the 
accuracy by training the models and setting the parameters. 


2.1 First dataset for fake news detection 


The dataset contains four columns of URL, Headline news, Body of 
the news and the class label which show whether the article news 
is real, or it is fake news [11]. The dataset for machine learning 
models is appropriate since the class labels are balanced among both 
groups of fake and real news. The proportion of label categories 
(real and fake) is not much different from each other, which does 
not bias a specific class due to different prior probability for one 
class label. The frequency of each group of the label is shown in 
Figure 1. 

The dataset includes 4009 article news. There are 2137 (53.3%) fake 
news, and 1872 (46.7%) real news. The URL and Headline of the 
news are complete for all 4009 articles. But for Body, there are 17 
actual news articles and four fake news articles with no or missing 
Body. There are 12 unique hostnames in this dataset. These news 
articles are taken from these 12 news websites: “abcnews”, "before 
its news", "bleacher report”, "clarivate", "dailybuzzlive", “activist 
post”, “BBC”, “CNN”, “disclose Tv’, "NYTimes" and "Reuters." There 
are two articles from api.content-ad.net. 


2.2 Second dataset for fake news detection 


This dataset was taken from kaggle.com [12] [13]. The columns in 
this dataset are the title (headline), author of the news, text (body) 
and the label, which specify whether it is fake news (1) or real news 
(0). Some of the news has a missing title, author name or body. 
This data includes 20,800 news articles. From 20,800 articles, 10,387 
(49.9%) are fake news and 10,413 (50.1%) are real news. The dataset 
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Figure 1: Frequency of fake and real news in the dataset. 
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Figure 2: Frequency of fake and real news in the dataset 


is balanced between the two categories of real and fake news. The 
frequency plot for author name, title and body for fake and real 
news is shown in Figure 2. 

For author names there are 1957 missing author names (1931 miss- 
ing in fake news and 26 missing in real news), no missing item for 
title and text of real news and there are 558 missing titles for fake 
news and 39 missing in text of the fake news. These 558 + 39 = 597 
news were removed from the dataset. After removing them there 
are 20,203 news where 10,387 (51.4%) of them are real news and 
9,816 (48.6%) are fake news. The dataset is still balanced and could 
be used for classification analysis. 


3 DATA PREPROCESSING 


Preprocessing is the most crucial step in machine learning. Certifi- 
able material is frequently incomplete, temperamental, or otherwise 
absent. Such propensities or examples are likely to include a few 
errors. Data pre-processing is a way of addressing these problems. 
Whatever data we get from Twitter are unfinished, inexact, or it 
might have some errors, like missing values, null values etc. Before 
we perform any task in NLP, we must preprocess the data or clean 
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the data to increase the data’s quality and make it meaningful and 
readable. After we process the data, the data’s size would be de- 
creased so we can handle it very accurately [13]. In this research, 
python and its libraries were used to perform preprocessing on 
the data. Preprocessing will cause all the digits, punctuations, stop- 
words, URLs to be removed from the news article. Preprocessing 
includes stopword removal, punctuation and digit removal, URL 
removal, separation of sentences, word tokenizing, word positions 
in a sentence, and converting words in lower case, word lemmati- 
zation, word tagging and concatenation. 


4 FEATURE EXTRACTION AND 
CLASSIFICATION 


After cleaning the data, it should be mapped into the numeric 
presentation in the form of vectors. It is a part of the reduction 
process of dimensionality. A large dataset contains many variables. 
Feature extraction helps to get the best features from big datasets to 
increase the accuracy of the model. Using Feature Extraction, words 
can be counted and the importance of the words in the dataset can 
be determined, which can help to reduce the redundant data from 
the dataset. Feature Extraction helps to minimize the number of 
features in a dataset by generating (and then discarding the original 
features) new features from the current ones. Much of the details 
found in the original set of features should then be represented by 
this new reduced set of features. In this research, three types of 
feature extraction techniques were used namely, Count vectorizer 
[16], TF-IDF vectorizer [17] and Word2vector Embedding [18]. 

After feature extraction, the following classification methods 
were applied to the datasets for the detection of fake news: Support 
Vector Machine (SVM) [19], Logistic Regression (LR), Naive Bayes 
(NB) [20], Decision Trees [21], Random Forest, K-nearest Neighbors 
(KNN) [22], and Recurrent Neural Networks (RNN) [23]. 


5 RESULTS AND DISCUSSION 
5.1 Classification results for the first dataset 


The classification was done on the dataset using the extracted 
features. For each model TF-IDF features were used for training the 
models by using the headline, body and combination of headline and 
body. To evaluate the accuracy of the classification models, 5-fold 
cross validation has been used by splitting the data randomly 5 times 
into 70% as training data and 30% as testing data. The classification 
accuracy of our proposed approach with that of (Agarwalla et al., 
2019) is given in Table 1. The results of classification are much 
higher compared to the results obtained by (Agarwalla et al, 2019). 
In the next step, using grid search for finding the best tuning param- 
eters for each model, the results could be improved much by using 
more features. The number of features, threshold for document 
frequency, tuning parameters for each classification model was 
found by using pipeline and grid search through various parameter 
ranges. 5-fold cross validation was used to find the optimal model. 
The accuracy of the model is presented in the Table 2 Performance 
of classifiers for the Body and Headline with AUC. 

The accuracy with 5-fold cross validation on combination of 
headline and body for the support vector machine is 0.98. This 
accuracy is 15% more than the accuracy found by (Agarwalla et 
al, 2019). This shows that using more features and considering 
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Table 1: Classification results compared with (Agarwalla et al, 2019 [4]) 


Feature set 


Naive Bayes with lidstone smoothing Support vector machine (SVM) 


Logistic Regression 


Current Study Current Study * Current Study i 
Headline + body 95.99 83.16 98.50 81.65 98.25 65.88 
Body 96.16 82.53 98.41 81.65 98.08 65.88 
Headline 89.22 68.05 89.31 66.24 89.22 66.57 


bigrams instead of unigrams in the feature extraction will lead to 
much improvement in the classification model. 

The 5-fold cross validation accuracy for Naive Bayes using the 
optimal parameters was found to be 0.963. The accuracy is much 
more than the one found by (Agarwalla et al, 2019). Although the 
maximum document frequency was found to be almost the same as 
what was used by them, but the difference here is that no maximum 
feature is considered for the number of features and bigram was 
used instead of unigram. 

The maximum document frequency for Logistic Regression was 
found to be 0.75. No threshold for maximum features were consid- 
ered. The bigram was found to be superior compared with unigram. 
The accuracy of Logistic Regression was found to be 0.986. It is 
much higher compared with (Agarwalla et al, 2019) which was 
0.6588. 

For Random Forest, the grid search results show that unigram 
features are superior compared with bigrams in contrast to all other 
classifiers which worked better with bigrams features. The optimal 
maximum document frequency was found to be 0.5. No maximum 
depth was considered for each decision tree. The random forest 
model with 100 estimator trees, the minimum samples for splitting 
each node was 2 and the maximum features in the split was taken 
as the square root of the number of features in the model. Gini 
impurity was used as criteria for fitting the Random Forest model. 
Gini impurity is a measure of likelihood that a new random variable 
being incorrectly classified. The accuracy of the Random Forest 
model was found to be 0.969 using the optimal tuning parameters. 

For k-nearest Neighbour in the grid search, value of k = 20 
which considers 20 neighbors for each observation was found to 
be optimal among 5 values which were tested (5, 10, 15, 20, 25). 
Maximum document frequency of 0.5 was selected as the threshold. 
The bigram was found to be superior compared with unigram. The 
k-nearest Neighbor classifier shows the accuracy of 0.939 with the 
5-fold cross validation using optimal parameters. 

The RNN-LSTM model was executed by using Adam optimizer, 
categorical cross entropy was used as the loss function. The model 
was run by keeping the batch size = 64 and was run for 10 epochs. 
The accuracy of the LSTM model for combination of headline + 
body is 0.98 % with loss equal to 0.137. 


5.1.1 Summary of results. The optimal values found using grid 
search were entered for each model. The data was split randomly 
by 70:30, keeping 70% of the data in the model for training and 
setting 30% out for testing. All 5 classification models were trained, 
and the models were tested using test data. The performance of 
the models for the test data (precision, recall, F1 score and total 
accuracy) and confusion matrix are presented in Table 2. 
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Among the classifiers, support vector machines (SVM) with pre- 
dicting 630 out of 643 fake news correctly and 549 out of 554 real 
news correctly has the highest accuracy, precision, and recall. Only 
13 fake news articles were classified as real news wrongly and 5 
real news were predicted as fake news wrongly. The total accuracy 
is 98.5%. The ROC curve for each of this classifier is presented in 
Figure 3 (a) ROC curve - classification results, test data (b) ROC 
curve for LSTM model ROC curve and in Table 2 Performance of 
classifiers for the Body and Headline with AUC. As we can see, the 
highest AUC is for SVM model with 0.9988 and LSTM with 0.9987. 
After that the AUC of Logistic regression is 0.9978, Random forest 
is 0.9977, Naive Bayes is 0.9941 and K nearest neighbors is 0.9815. 
The AUC of all 6 classifiers is very close to 1.0 which shows all 
classifiers performed very well. Area under the curve (AUC) for the 
support vector machine (SVM) is the highest compared with other 
classifiers. The ROC curve for LSTM is drawn separately in Figure 
3. The area under the curve AUC of the LSTM is almost close to 
SVM. 


5.2 Classification Results for the Second Dataset 


Classification analysis has been done using this big data, which 
include 20203 non-missing rows. Vijayaraghavan et al, (2020) have 
used 3-fold cross validation using this dataset. For comparison, a 
3-fold cross validation was done using feature extraction methods 
of count vectorizer, TF-IDF vectorizer, word2vec embedding and im- 
plementing classification models of support vector machine (SVM) 
with C = 10-05, Logistic Regression with alpha = 0.1 and Random 
Forest model with number of estimators as 1000. Using these mod- 
els and implementing cross validation the accuracy achieved is a bit 
higher (97.22%) than accuracy found by (Vijayaraghavan et al, 2020). 
They have claimed that the count vectorizer performed is the best 
in this dataset, while in our analysis TF-IDF vectorizer with (1,3) 
gram performs best with highest accuracy compared with other 
models in Table 3. 

The TF-IDF was the best feature extraction method and like the 
analysis done by Vijayaraghavan et al, (2020) the word2vec is the 
worst feature extraction method for these datasets. 

Grid search with pipeline was used for the second dataset also 
to find the optimal tuning parameters. After implementing the grid 
search with 5-fold cross validation for each grid, it was seen that 
among these parameters the maximum document frequency of 0.95 
which do not set limits for document frequency, maximum feature 
as None which sets no limitation for maximum features, Penalty L2 
with alpha equal 1e-05, normalization parameters of 12 were found 
to be optimal. It was seen that (1,3) gram which means including 
1-gram, 2-gram and 3-gram features in the feature vector has been 
found as the optimal feature. The minimum document frequency of 
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Table 2: Performance of classifiers for the Body and Headline with AUC 


Classifier headline+ body fake real precision recall Fl Accuracy AUC 
observed _ observed 
Naive Bayes fake n = 643 607 36 0.98 0.94 0.96 
real n = 554 12 542 0.94 0.98 0.96 0.96 0.9941 
SVM fake n = 643 630 13 0.99 0.98 0.99 
real n = 554 5 549 0.98 0.99 0.98 0.98 0.9988 
Logistic fake n = 643 626 17 0.99 0.97 0.98 
Regression real n = 554 4 550 0.97 0.99 0.98 0.98 0.9978 
Random Forest fake n = 643 599 44 1.00 0.93 0.96 
real n = 554 2 552 0.93 1.00 0.96 0.96 0.9977 
K-NN fake n = 643 591 52 0.95 0.92 0.93 
real n = 554 32 522 0.91 0.94 0.93 0.93 0.9815 
LSTM fake n = 643 621 22 1.00 0.97 0.98 
real n = 554 1 553 0.96 1.00 0.98 0.98 0.9987 
Receiver Operating Characteristic (ROC) Curve Receiver Operating Characteristic (ROC) Curve 
1.0 1.0 
08 08 
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Figure 3: : (a) ROC curve - classification results, test data (b) ROC curve for LSTM model. 
Table 3: Results of three classification models compared with (Vijayaraghavan et al, 2020) 
Feature set Support vector machine Logistic Regression Random forest 
Current Study * Current Study * Current Study 
countvect 95.78 93.06 97.50 94.45 96.19 87.64 
TF-IDF vect 97.76 94.58 62.75 94.79 96.25 87.64 
word2vec 87.11 91.17 82.00 88.60 


91.30 86.3 


0.007 was found as the optimal value. The accuracy of 5-fold cross 
validation for the support vector machine is 97.76%. 

The grid search results for Naive Bayes classifier show that 
smoothing parameter alpha = 1.0 is optimal. The maximum docu- 
ment frequency was found to be 0.5. No threshold for maximum 
features was considered. (3,3) grams or trigram was found to be 
best. The 5-fold cross validation accuracy for Naive Bayes using 
the optimal parameters was found to be 0.90. 

The optimal values for Logistic Regression were found to be L2 
penalty with alpha = 1e-05. The maximum document frequency for 
Logistic Regression was found to be 1.0. No threshold for maximum 


IDEAS2022 


features were considered. The (1,3) gram was found to be optimal. 
Results of the Count vectorizer were better compared with TF-IDF. 
The minimum document frequency of 0.007 was found to be a 
good option for removing the tokens with very low frequency. The 
accuracy of Logistic Regression was found to be 0.975. 

In the Random Forest, values of None and 10 was used for maxi- 
mum depth of tree. The maximum documents frequency of 0.1, 0.2, 
0.5, 0.75 and 1.0 was tested. The maximum features of None, 100, 
200, 500 and 1000 have been used. The N-Gram values of (1,1), (1,2), 
(1,3), (2,2) and (3,3) have been used. The optimal value of maximum 
features was found as 500, the (1,1) gram was found to be best in 
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Table 4: Performance of classifiers for the text with AUC 


Classifier Body fake real precision recall Fl Accuracy AUC 
observed observed 
Naive Bayes fake n = 1559 1542 17 0.80 0.99 0.88 
real n = 1441 386 1055 0.98 0.73 0.84 0.87 0.984 
SVM fake n = 1559 1521 38 0.98 0.98 0.98 
real n = 1441 29 1412 0.97 0.98 0.98 0.98 0.995 
Logistic fake n = 1559 1520 39 0.98 0.97 0.98 0.97 
Regression real n = 1441 36 1405 0.97 0.98 0.97 0.995 
K-NN fake n = 1559 1441 118 0.82 0.92 0.87 
real n = 1441 318 1123 0.90 0.78 0.84 0.85 0.940 
fake n = 1559 1484 75 0.99 0.95 0.97 
LSTM real n = 1441 12 1429 0.95 0.99 0.97 0.97 0.996 


Receiver Operating Characteristic (ROC) Curve 
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Figure 4: (a) ROC curve, classification results-testing data (b) ROC curve for LSTM model. 


Random Forest. The maximum document frequency 1.0 was found 
to be optimal. No maximum depth was considered for each deci- 
sion tree. The random forest model with 100 estimator trees, the 
minimum samples for splitting each node was 2 and the maximum 
features in the split was taken as the square root of the number 
of features in the model. Gini impurity was used as criteria for fit- 
ting the Random Forest model. The accuracy of the Random Forest 
model was found to be 96.19% using the optimal tuning parameters. 

For k-nearest Neighbor in the grid search, k values of (5, 10, 
15, 20 and 25) have been used. The options for document frequen- 
cies and N-gram were similar to Naive-Bayes. Maximum document 
frequency of 0.75 was selected as the threshold. The (1,3) gram 
was found to be superior compared with unigram and bigram. The 
value of K=15 was found as the optimal value. The k-nearest Neigh- 
bor classifier shows the accuracy of 84.9% with the 5-fold cross 
validation using optimal parameters. 

The LSTM model was trained with an Adam optimizer and by 
using categorical cross entropy as a loss function. A batch size of 64 
articles was set in each batch. The model was trained for 10 epochs. 
Features were generated from sequences of words, maximum num- 
ber of sequences considered as 1100. The LSTM model achieved an 
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accuracy of 97.1%, which is a bit more than the best accuracy found 
by Vijayaraghavan et al, (2020) using three LSTM models. 


5.2.1 Summary of Results. The accuracy of the testing data (preci- 
sion, recall, F1 score, AUC and total accuracy) and confusion matrix 
are presented in the Table 4 Performance of classifiers for the text 
with AUC. Among the classifiers, support vector machines (SVM) 
with predicting 1521 out of 1559 fake news correctly and 1412 out 
of 1441 real news correctly has the highest accuracy, precision, and 
recall. 38 fake news articles were classified as real news wrongly 
and 29 real news were predicted as fake news wrongly. The total 
accuracy is 97.7%. The ROC curve for each of this classifier is pre- 
sented in Figure 4(a), for LSTM it is presented in Figure 4 (b) and in 
Table 4. As we can see, the highest AUC is for LSTM with 0.9960. 
The AUC of Naive Bays is 0.9848, Logistic regression is 0.9953, 
Random forest is 0.9877, SVM is 0.9956 and K nearest neighbors is 
0.9408. 


6 CONCLUSIONS AND FUTURE WORK 


A methodology was proposed to detect fake news using deep learn- 
ing models and natural language processing (NLP). In this study, 
two datasets were analyzed, and classifiers were trained to develop 
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a model that can predict the news type, whether it is fake news or 
real news. 

The first dataset includes article news from 12 news websites. 
The collected data from each URL is such that all the news is fake or 
real from each of these 12 websites. The tagged words were used for 
feature extraction by count vectorizer and TF-IDF vectorizer. It was 
seen that TF-IDF vectorizer with (1,2) Gram (one and two consecu- 
tive tokens) gets the best features since the highest cross-validated 
accuracy was found by using (1,2) gram features. For classification, 
the Naive Bayes classifier, Support Vector Machine (SVM), Logis- 
tic Regression, K nearest Neighbours (KNN) and Random Forest 
were used to classify the article news as fake or real. The model 
with the support vector machine (SVM) achieved the highest ac- 
curacy (98.5%) in combination with body and headline and area 
under the ROC curve (AUC) (0.9988). Also, the LSTM model with 
three recurrent layers and one hidden layer was used as a deep 
learning method to classify the article news. The results of the 
LSTM model (98.03%) and ROC Curve (AUC) 0.9984. The model’s 
accuracy considering the combination of body and headline is very 
high, especially in the SVM model. 

Some interesting information was found using general features. 
It was observed that news with a greater number of words in the 
headline, news with more stop words used in the body part are 
more likely to be fake news. In contrast, news with a greater num- 
ber of sentences in the body part, more words that are verbs or 
adjectives in the body part make the news more likely as real news. 
Therefore, one can conclude that real news does not tend to use big 
headlines, but more explanation in the body part. More sentences, 
especially those which include a greater number of words (as verbs 
or adjectives), increase the probability of being real news. More 
stopwords are found in the fake news compared with real news. 
The nouns in the body of the news articles do not increase the 
probability of either real or fake news. A notable thing noticed in 
the features was the word "photo" and "image" appeared more in 
the real news, while the word "video" appeared more in the fake 
news. Commonly used words such as "great" are more frequently 
seen in the fake news compared with real news. The fake news 
tries to pretend that it is real news, so the feature "law" can be seen 
more frequently in fake news than real news. 

The second dataset also consisted of news articles taken from 
Kaggle.com, was analyzed with the same methodology, which was 
performed on the first dataset. For the second dataset, it was ob- 
served that nouns, adjectives, stopwords and non-alphabetic words 
were used more in fake news than real news. The real news had 
more verbs compared with fake news. In the first dataset, it was 
observed that the fake news has longer headlines. But in the second 
dataset, the results were reversed. It was seen that real news has 
longer headlines significantly. Therefore, this parameter differs be- 
tween the two datasets. Another difference was the use of adjectives, 
which were significantly more in this dataset’s fake news. The clas- 
sification analysis was done using support vector machines, logistic 
regression, and random forest. The analysis used tokenizers with 1- 
gram to 3-gram and tuning parameters of the classifiers. The results 
demonstrated that 3-gram gave the best accuracy of 97.7% and ROC 
Curve (AUC) 0.9956 with SVM for this dataset which is Higher than 
results shown by Vijayaraghavan et al. (2020). Also, using LSTM 
97.1% accuracy and ROC Curve (AUC) 0.9960 was achieved which 
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is more than the accuracy achieved by Vijayaraghavan et al. (2020) 
which is 94.88%. For Logistic Regression 97.50% accuracy and ROC 
Curve (AUC) 0.9953 was achieved which is an improvement than 
the accuracy achieved by Vijayaraghavan et al. (2020) which is 
94.79%. However, their analysis gets the count vectorizer as the best 
feature extraction method and word2vec as the worst performing 
on this dataset. This study found that word2vec with 3-gram was 
the best feature extraction method. It can be concluded that the use 
of certain features should be retained instead of removing them at 
the preprocessing step to get a higher accuracy of prediction, even 
without using deep learning models. 


6.1 Future Work 


This analysis was performed on two different datasets. Using the 
first dataset, some idea could be used in future studies to enhance 
the quality of the models: 


1- Collecting data randomly from each world news website to 
reduce the bias in the dataset. 

2- Knowing that after using the maximum document frequency 
for the words in the dataset, some of the words are missing, 
which may lead to missing some useful information for dis- 
tinguishing between real and fake news. It can be considered 
to add a feature as several commonly used words in each 
news article and see its difference between real and fake 
news. 

3- More general features could be extracted from the data like 
number of hashtags, number of mentions, number of each 
type of punctuation separately, number of characters, num- 
ber of digits and unit specifiers like (kg, lbs, inches, feet and 
so on) to find the difference between them in real and fake 
news. 


The second dataset looked better and was a big dataset. The 
analysis performed on this dataset showed some common results 
with the previous dataset and some reverses. Hence, it seems to 
be necessary to find which features are data-related and which are 
consistent in various data. Also, the published real and fake news 
behavior could be investigated by the country of release. Several 
independent variables can be tested at the time of release to see if 
the changes are time-dependent or country dependent or was just 
changed because the feature was not significant. It was observed 
that considering the tag beside the words in the tokenizing is a 
useful idea. This idea could be investigated more to see how good 
it is and how it is possible to improve the tagging portion, which is 
added to the token. Depending on the tag connected to the word, 
whether the word is noun, verb or adjective, the frequency of the 
word can be added for that specific tag. However, it seems important 
to make a relationship between these tokens, which are very similar 
to each other in the feature extraction phase. 


REFERENCES 


[1] S. Russel and P. Norvig, Artificial Intelligence: a Modern Approach, Pearson, 
2002. 

[2] R. Kamble and D. Shah, "Applications of Artificial Intelligence in Human Life," 
International Journal of Research Granthaalayah, pp. 6(6)178-188, 2018. 

[3] N. Indurkhaya and F. J. Damerau, Handbook of Natural Language Processing, 
Chapman & Hall, 2010. 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


[10] 


K. Agarwalla, S. Nandan, V. A. Nair and D. D. Hema, "Fake News Detection using 
Machine Learning and Natural Language Processing," IJRTE, pp. 7(6) 2277-3878, 
2019. 

E. Levin, "A recurrent neural network: Limitations and training,’ Neural Networks, 
vol. 3, no. 6, pp. 641-650, 1990. 

X. Zhou and R. Zafarani, "A Survey of Fake News: Fundamental Theories, De- 
tection Methods and Opportunities," ACM Computing Surveys, pp. (53) 5-40, 
2021. 

M.S. Looijenga, "The Detection of Fake Messages using Machine,’ in 29th Twente 
Student Conference on IT, Enschede, Netherlands, 2018. 

UKEssays, "Analysis of Obama’s Twitter and Communication Strategies in 
the 2012 Presidential Election," UKEssays, November 2018. [Online]. Avail- 
able: https://www.ukessays.com/essays/media/analysis- of- obamas-twitter-and- 
communication- strategies-in-the-2012-presidential-election.php. [Accessed 16 
01 2021]. 

H. E. Wynne and Z. Z. Wint, "Content Based Fake News Detection Using N- 
Gram Models," in iiWAS2019: The 21st International Conference on Information 
Integration and Web-based Applications & Services, Munich, Germany, 2019. 

S. Vijayaraghavan, Y. Wang, Z. Guo, J. Voong, W. Xu, A. Nasseri, J. Cai, L. Li, K. 
Vuong and E. Wadhwa, "Fake News Detection with Different Models,’ ArXiv, no. 
2003.04978, 2020. 


[11] JRuvika, "Fake News Detection," kaggle, 2017. [Online]. Available: https://kaggle. 


[12] 


com/jruvika/fake-news- detection. [Accessed 16 01 2021]. 
"Fake News," kaggle, 2018. [Online]. Available: https://www.kaggle.com/c/fake- 
news. [Accessed 16 01 2021]. 


IDEAS2022 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


Kalpdrum Passi and Aanan Shah 


V. Gurusamy and S. Kannan, "Preprocessing Techniques for Text Mining,” in 
RTRICS, Theni, India, 2014. 

E. Loper and S. Bird, "NLTK: The Natural Language Toolkit," ArXiv, no. 0205028, 
2002. 

T. Korenius, J. Laurikkala, K. Jarvelin and M. Juhola, "Stemming and lemmati- 
zation in the clustering of finnish text documents," in CIKM ’04: Proceedings 
of the thirteenth ACM international conference on Information and knowledge 
management, Washington, USA, 2004. 

J. Teo, "readthedocs,’ 2018. [Online]. Available: https://readthedocs.org/projects/ 
socialnetwork/downloads/pdf/latest/. [Accessed 16 01 2021]. 

A. Aizawa, "An information-theoretic perspective of tf-idf measures,’ Information 
Processing and Management, vol. 39, no. 1, pp. 45-65, 2003. 

T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed Represen- 
tations of Words and Phrases and their Compositionality,’ ArXiv, no. 1310.4546 , 
2013. 

C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, no. 20, 
pp. 273-295, 1995. 

D. Berrar, "Bayes’ Theorem and Naive Bayes Classifier," in Reference Module in 
Life Sciences, 2018, pp. 1-18. 

P. H. Swain and H. Hauska, "The decision tree classifier: Design and potential,” 
IEEE Transactions on Geoscience Electronics, vol. 15, no. 3, pp. 142-147, 1977. 
Z. Zhang, "Introduction to machine learning: k-nearest neighbors,’ ATM Annals 
of Translational Medicine, vol. 4, no. 11, 2016. 

N. Kalchbrenner, I. Danihelka and A. Graves, "Grid Long Short-Term Memory,’ 
ArXiv, no. 1507.01526, 2016. 


Self-Adapting Design and Maintenance of Multi-Model Databases 


Irena Holubova Pavel Koupil Jiaheng Lu 
Department of Software Engineering, Department of Software Engineering, Department of Computer Science, 
Charles University Charles University University of Helsinki 


Prague, Czech Republic 
irena.holubova@matfyz.cuni.cz 


ABSTRACT 


Multi-model data is organised in various mutually interlinked for- 
mats and models, often with contradictory features. In addition, 
its structure may change over time, and its size can grow to the 
extremes of Big Data. In terms of research and practical processing, 
this creates one of the most complex challenges of effective data 
management. 

As it is not humanly possible to handle such a complex task 
manually, in this vision paper, we focus on the area of automatic 
management of dynamic multi-model Big Data. We envision a 
framework capable of accepting different levels of user input and 
different types of data, queries, changes, and propagation strategies 
and ensuring the preservation of adequate and efficient data access. 
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1 INTRODUCTION AND MOTIVATION 


From the 388 existing database management systems (DBMSs)!, 
more than 28% are classified as multi-model, i.e., supporting more 
than one logical model. If we consider only the 50 most popular 
representatives, involving the key players, such as, e.g., Oracle 
DB, PostgreSQL, MongoDB, Microsoft SQL Database, Redis etc., we 
get 60% systems with the multi-model support. This observation 
corresponds to the Gartner predictions [11] of supporting multiple 
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data models in a single DBMS anzd it reflects the requirements of 
current applications. 


EXAMPLE 1. An example of a multi-model scenario of an e-shop is pro- 
vided in Fig. 1. The relational model (violet) contains general information 
about customers, whereas the graph model (blue) captures their mutual friend- 
ship. The document model (green) maintains orders bounded with particular 
customers using the wide-column model (red). The key/value model (yellow) 
bears information about customers’ shopping carts. As we can see, storing 
each record in the best fitting model avoids impedance mismatch. A sample 
cross-model query over the multi-model data instance might then, e.g., be 
“For each customer who lives in Prague, find a friend who ordered the most 
expensive product among all customer’s friends." [58] 


According to the extensive survey [33], the features of multi- 
model databases vary significantly. This status is given by the fact 
that (1) they are based on the distinct original core single model 
as well as the different target application domains, (2) there is 
no acknowledged standard on how to support a combination of 
models, cross-model querying, multi-model indices etc., and (3) the 
combined models have varied, often contradictory features. There 
are structured, semi-structured, and unstructured formats; order- 
preserving and order-ignorant models; systems based on strong or 
eventual consistency; schema-less, schema-full, and schema-mixed 
storage strategies; models where data normalisation is critical or 
where redundancy is supported etc. For all these cases, there exist 
multi-model representatives [13, 33, 42]. In addition, while the 
multi-model data by definition covers the Variety feature of Big 
Data, many of the multi-model DBMSs are distributed and target 
also its high Volume and Velocity. 


EXAMPLE 2. In Fig. 2 we can see an ER model of the multi-model scenario 
from Fig. 1. It depicts the natural first step of designing a database application 
to be transformed into a selected logical model. In the case of the relational 
model, the transformation would be more or less straightforward to avoid 
redundancy and null values. Or, in the case of the document model, we would 
also need to select the roots of the hierarchies to reflect the expected queries. 
However, in the case of multi-model data, the possibilities are much wider. The 
colours denote the combination of multiple logical models corresponding to 
Fig. 1. 


The described variety of multi-model DBMSs lacking standards 
and the general complexity of multi-model applications indicate 
the main open problems of multi-model data management: (1) The 
initial choice of a multi-model DBMS is challenging, and the even- 
tual later necessary migration to another system is complicated. (2) 
The complexity of multi-model tasks requires complex and critical 
decisions to be made for the optimal database design, i.e., data par- 
titioning, (partial) schema definition, indices, (materialised) views, 
queries etc. (3) The naturally dynamic environment of multi-model 
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column family Orders key/value pairs Cart JSON collection Order 


customerld orders 

1 (64,1), (52)5 =<] 
customerld orders 

2 (2,1), (2.2)] 
customerld orders 

3 [(3,1), (3,2), (3,3)] 


1 —» [|product: T1, quantity: 2 


2 —» |product: H1, quantity: 1 
3 —» |product: B3, quantity: 2 


{ _id : { customer : 1, number : 2}, 
contact: { 
cellphone : +420123456789, 
email : mary@smith.cz }, 
items: [ 
{id: B1, name: Pyramids, 
price: 200, quantity: 2}, 
{ id: A7, name: Sourcery, 
price: 200, quantity: 1 }]} 


Figure 1: A multi-model scenario (inspired by [22]) 


) 
jer 


Figure 2: An ER model of the multi-model scenario 


Big Data significantly increases the complexity of evolution man- 
agement, i.e., the capability of correct and efficient adapting of a 
multi-model database application to the new requirements. 

In this vision paper, to solve these problems, we envision a self- 
adaptive framework that would enable the design and maintenance 
of a multi-model database schema under the changing requirements 
of Big Data. We identify three levels of such a system that cover 
different real-world use cases and correspond to the gradual exten- 
sion of the adaptability. In particular, we consider (1) user-specified 
changes and rule-based adaptation, (2) data-driven changes and 
learning-based adaptation, and, finally, (3) advanced self-adapting 
evolution management. We discuss the state-of-the-art as well as 
research challenges and tasks to be completed to reach the full 
robustness of the idea. 

The rest of the paper is structured as follows: In Section 2, we 
provide an overview of related work. Section 3 discusses the three 
levels of adaptation the framework should gradually reach. And we 
conclude in Section 4. 
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2 RELATED WORK 


In general, there are two approaches to manipulate and query multi- 
model data: (1) polyglot persistence and (2) multi-model databases. 
The history of polyglot persistence can be traced back to the 1970s/80s 
to multi-databases [47] and federated databases [14], whose main 
strategy is to leverage different databases to store different models 
of data and then develop a mediator to integrate them together in 
order to evaluate queries. Recently, several academic prototypes of 
polystores (e.g., DBMS+ [31] or BigDAWG [10]) were also developed 
on the polyglot persistence paradigm. The challenge of handling 
the Variety of Big Data has inspired a new commercially popular 
generation of multi-model databases [33], capable of storing and 
processing structurally different data by supporting several data 
models in a single DBMS having a unified query language and API. 
Multi-model databases manage multiple models, each being treated 
as a first-class citizen, with an integrated backend, which can satisfy 
the growing requirements for scalability, high performance, and 
fault tolerance. 

Techniques focusing on the problem of database design can be 
divided into rule-based, search-based, and model-based, all of which 
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can be general or system-specific. As summarised in [40], the first 
related solutions from the 1970s focused on index selection and 
database partitioning for relational DBMSs. Around the turn of 
the millennium, there was a boom of self-tuning DBMSs finding 
optimal database settings (i.e., indices, materialised views, parti- 
tioning schemes etc.) for a given query workload or optimal “knob” 
tuning (e.g., memory allocation). However, the approaches were 
mostly rule-based exploiting heuristic rules and various levels of 
user involvement. Unfortunately, data management heuristics can 
grow complex when trying to change them from specific to general 
cases [9]. Alternative, search-based approaches (e.g., [61]) search 
the space of possible configurations for a (sub)optimal solution. 

With the dawn of Big Data entailing the introduction of novel 
DBMSs and the novel approaches in artificial intelligence (Al), 
such as the deep reinforcement learning [59] or Bayesian optimiza- 
tion [8], model-based approaches capable of learning and adapting 
the choice of the solution have recently appeared. However, in 
all the cases, the solutions are system- or data model-specific or 
highly limited in terms of data structures. In contrast, the area of 
multi-model Big Data in its full complexity is still untouched. 

In general, the approaches connecting database technologies and 
Al can be divided into two groups [28]. DB4AI techniques can opti- 
mize AI models and strategies, such as, e.g., the Al-native DBMS 
openGauss [29] which supports the native AI computing engine, 
model management, AI operators, native AI execution plan etc. 
Conversely, in AI4DB techniques, where also the envisioned frame- 
work belongs, the AI can make DBMSs more intelligent. There 
are learning-based techniques exploiting, e.g., reinforcement learn- 
ing or deep learning, for database configuration, such as an index 
selection [21], an index advisor [35], a partitioning advisor [17], 
or general knob tuning [59]; query optimization [50] or join order 
selection [36]; design, such as learned indices [7] or the key/value 
design [19]; monitoring predicting, e.g., query arrival rates [34] 
or performance [60], security involving, e.g., data anomaly detec- 
tion [32] etc. 

Recent years have also witnessed the emergence of the autonomous 
or self-driving DBMSs (e.g., Oracle Autonomous Database [39] or 
Peloton [41]) which are expected to automatically and constantly 
configure, tune, and optimise themselves without any intervention 
from human experts. Since the optimal configuration setting is 
highly dependent on the workload characteristics, a critical step 
for an autonomous DBMS is to predict the future workload based 
on the historical data. Firstly, the DBMS should be able to forecast 
when the workload will significantly change (i.e., workload shift), 
how many workloads will arrive (ie., arrival rate), and what is 
the following query that a user will execute (ie., next query) in 
the future. That predicted workload information enables an au- 
tonomous DBMS to decide when and how to re-configure itself in a 
predictive manner before the workload changes occur. Secondly, an 
autonomous DBMS also needs to predict the query performance by 
estimating an essential run time before execution, such as how long 
a query will take to complete (i.e., execution time) and how much 
resources will be consumed (i.e., resource utilisation). Predicting 
the execution time and resource demand before execution is helpful 
in many tasks, including query scheduling, progress monitoring, 
and resource management [55, 57]. 


IDEAS2022 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


In the context of related work, the envisioned idea of this paper 
aims to cover the area of multi-model database maintenance in a 
dynamic environment, focusing on a combination of database recon- 
figuration, redesign, and query rewriting utilising both rule-based 
and model-based approaches. In addition, we target universally 
applicable multi-model generalisation of the so far considered single- 
model DBMSs. 


3 RESEARCH CHALLENGES AND 
ENVISIONED SOLUTIONS 


The optimal database design in the context of multi-model data 
and its maintenance under changing requirements and conditions 
is a challenging aim. In the case of a simple use case, we can rely 
on a skilled human database administrator (DBA); however, the 
multi-model tasks are, in essence, complex, especially when dealing 
with Big Data, and thus hardly manageable manually. By extending 
and integrating basic research solutions, a robust framework can be 
designed, capable of (1) accepting different levels of user interaction 
as well as different types of input data, queries, and changes, (2) 
supporting a wide range of propagation and modification strate- 
gies reflecting specific multi-model features, and (3) ensuring the 
preservation of adequate and efficient data access. However, there 
are critical features that need to be taken into account to achieve a 
truly robust solution: 


(1) Universal Applicability and Portability: The proposed ap- 
proaches must be applicable to any multi-model DBMSs (or 
polystore), i.e, any combination of existing popular models. 

(2) Wide Range of Use Cases: The proposed solution should cover 

a wide range of real-world use cases. For this purpose, it is 

necessary to utilise a combination of rule-based and model- 

based strategies based on the optional initial user settings 
and decisions, as well as approaches to extracting the knowl- 
edge purely from the variable input data, queries etc. 

Practical Impact: All the proposed algorithms must still pre- 

serve a tight relation to the existing systems so that they can 

be exploited in real-world scenarios and implementations. 


(3 


wm 


In this section, we discuss three levels of the adaptation process 
we have identified that simultaneously form gradual extensions of 
the envisioned self-adapting framework. 


3.1 Level I. User-Specified Changes and 
Rule-Based Adaptation 


In the rest of the text, we consider the following popular data 
models: aggregate-oriented key/value, document, and wide-column, 
together with aggregate-ignorant relational, array, and graph. At the 
logical level, the transition between two models can be expressed 
either via (1) inter-model references or by (2) embedding one model 
into another (such as, e.g., columns of type JSONB in relational 
tables of PostgreSQL). Another possible combination of models is 
via (3) cross-model redundancy, i.e., storing the same data fragment 
in multiple models. 

To “grasp" the specifics of various data models in a unified way 
and propose a solution that is not system-specific but can be easily 
transferred to another system, we need a more abstract unified 
representation of multi-model data. Currently, there exist several 
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proposals, such as the NoSQL AbstractModel [3] for aggregate- 
oriented databases or the unified abstract representation of multi- 
model data based on the category theory [22, 23]. The latter enables 
the view of multi-model data as a categorical graph mapped to a 
particular combination of models. The graph can also be queried 
using a categorical graph query language mapped to a system- 
specific cross-model query language. This abstract multi-model 
schema can either be created manually [25] or inferred from input 
data [24]. Moreover, unifying categorical approaches are proven to 
be suitable for applications in, e.g., machine learning [46]. 

In this first level, we can assume that the user creates a basic 
multi-model system-specific database schema and transforms it 
to its abstract unified schema over which changes representing 
new requirements are then specified’. As the first step, we need 
to ensure the necessary core functionality, ie., their correct and 
complete propagation to all affected parts of the system (i-e., logical 
data structures and data instances), including the respective opera- 
tions and related structures (i.e., queries and views, indices etc.) [4]. 
The rule-based approach must consider the specifics of all mod- 
els, all types of inter-model transitions, the transformation of data 
during cross-model migration etc. Moreover, it must identify cases 
when multiple options are possible and user input or additional 
information (e.g., queries, data statistics etc.) is needed. These cases 
are to be decided at further levels using AI approaches. 

Currently, there exists several papers dealing with rule-based 
evolution management of single-model systems, such as XML [38, 
43] or relational [6], and REST APIs [44]. There also exist ap- 
proaches dealing with closely related aggregate-oriented models [16, 
49]. In the case of multi-model databases, this task is more subtle 
and difficult since, except for intra-model changes, we have to deal 
with inter-model changes for which the single-model approaches 
cannot be directly re-used, together with cross-model redundancy, 
cross-model integrity constraints etc. Apart from the recent prelim- 
inary academic prototype MM-evolver [53], there are, in principle, 
no tools supporting schema evolution in multi-model databases in 
its full complexity. 

To sum up, the first level of an adaptable multi-model framework 
needs an abstract representation of any combination of popular 
models. A minimal and complete set of schema modification opera- 
tions (SMOs), both basic and composite, and their precise semantics 
must be then defined, together with an algorithm for efficient prop- 
agation to data instances. A crucial aspect of multi-model evolution 
management [18] is the propagation of changes to all types of inter- 
model transitions. The propagation strategies should also consider 
multi-model queries. And last but not least, a set of cost functions 
needs to be proposed reflecting the complexity and efficiency of 
each transformation regarding the given use case, i.e., data statistics, 
query workload, storage strategies, selected DBMS-specific features 
etc. 


3.2 Level II. Data-Driven Changes and 
Learning-Based Adaptation 
In the second level, we consider the case when the user performs 


the initial setting of the system and then (s)he lets the system 


“If specified over the logical schema of the selected multi-model database, they can be 
propagated to the abstract schema. 
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continue (semi-)autonomously. So, in this new context, the user 
does not specify the changes explicitly using SMOs. They need to 
be extracted from the changing data and propagated to the unifying 
model, where all affected parts of the system can be identified. Since 
there is no direct user input or feedback, AI approaches need to be 
utilised to decide the ambiguous cases in change identification and 
propagation strategy. 

Another source of changes that need to be utilised is the changes 
in queries. Such a change may need to be propagated to storage 
strategies to preserve/increase the efficiency of query evaluation. 
Besides traditional single-model strategies, such as modification of 
indices, the support for redundancy in distributed DBMSs, includ- 
ing, e.g., (cross-model) materialised views or cross-model redun- 
dancy, can be exploited. 

Identifying a change from input data can be viewed as a par- 
ticular case of dynamic schema inference. Even in schema-less 
databases, an implicit schema, i.e., a structure of the data expected 
by the application, exists. Hence, the idea of schemalessness is 
often instead characterised as schema-on-demand needed when 
the data is to be processed. However, most existing approaches as- 
sume a static input data set and focus only on single-model schema 
inference. A large set of schema inference approaches, both heuris- 
tic [37] and grammar-inferring [2], can be found for XML data, 
which is accompanied by standard schema definition languages 
DTD and XML Schema. There also exist papers that focus on in- 
ference of integrity constraints [54] or Schematron schemas [26]. 
With the dawn of NoSQL databases and the related popularity of 
the JSON format, there also appeared approaches inferring (Big) 
JSON data [1, 12] or general approaches for aggregate-oriented 
databases [5, 45]. The first and, to the best of our knowledge, also 
the only result focusing on multi-model schema-inference is the 
scalable academic prototype called MM-infer [24]. 

Optimising query evaluation using various AI approaches has 
been the focus of researchers for many years. Especially with the 
arrival of Big Data and deep learning approaches, a new wave of pro- 
posals has occurred. There exist proposals for, e.g., learned indices 
which reflect the actual distribution of data [21, 27]; join order se- 
lection based on reinforcement learning [36]; estimation of benefits 
of materialized views using deep reinforcement learning [15, 30]; 
or autonomous tuning of database knobs [52]. However, most ap- 
proaches solve only the single-model case or are system-specific. 

First, a multi-model schema inference approach capable of in- 
ferring a schema dynamically, i.e., for a changing data set, needs 
to be designed. The process must identify the changes/extensions 
of the data structures, including alternatives, and map them to the 
SMOs proposed in the first level. Second, we need to extend the 
propagation strategies and the respective costs with further trans- 
formations influencing query evaluation, such as modification of 
indices, materialised views, cross-model redundancy etc. Finally, 
having the set of SMOs and their alternatives, together with a set 
of respective propagation strategies and their costs, the framework 
can be extended toward autonomous decisions by exploiting super- 
vised learning techniques to mimic the decisions of a human user 
in the multi-model context. 

Considering the query performance, a closely related approach 
is represented by autonomous DBMSs expected to automatically 
and constantly tune themselves by adapting to data and workload 
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changes. There are three major topics on workload-aware perfor- 
mance tuning for an autonomous DBMS: (1) workload and data 
classification, (2) workload and data forecasting, and (3) system 
tuning [56]. The goal is to enable the multi-model DBMSs to con- 
tinuously and automatically adjust databases’ configurations by 
analyzing the evolving multi-model workload and making optimal 
decisions for changing workload types and data. The corresponding 
tuning tasks consist of two main aspects: (1) multi-model database 
design and (2) resource provisioning. In the former one, based on 
the multi-model workload changes, the databases need to evolve 
the physical design, such as indexes, materialized views, partition- 
ing, and storage, to achieve optimal performance. Sometimes the 
databases also have to re-design the multi-model schema according 
to the workload information. In the latter case of resource provision- 
ing, the multi-model database must estimate the needed hardware 
resources to support a new workload not yet deployed in a pro- 
duction environment, including CPU, RAM, Disk I/O, buffer pool 
size, page size, etc. As we can see, the latter aspect is tightly bound 
with a particular DBMS and its specifics. Since we aim to provide 
a universal approach, another open problem is to find the balance 
between universal applicability and maximization of efficiency. 


3.3 Level III. Advanced Self-Adapting Evolution 
Management 


The previous two levels ensure that the DBMS is entirely and cor- 
rectly modified to ensure the same functionality and at least the 
same efficiency of query evaluation within the dynamic environ- 
ment. The last considered level will focus on advanced and more 
complex use cases (i.e., inputs) and their solutions (ie., propagation 
strategies) to reach complete self-adapting robustness. 

First, an important change to consider regarding the input is 
when there is no initial user-specified database schema, i.e., we 
only have the input data and queries to be efficiently evaluated. In 
other words, we approach the concept of a data lake?. Next, we 
have to consider that some of the changes detected in the data 
may represent errors/outliers. The occurrence of syntactically, se- 
mantically, or statistically anomalous data needs to be detected to 
avoid their complex but unsolicited propagation, ie., significant 
unwanted changes in the system. 

Second, regarding the change propagation, we have so far consid- 
ered only the eager strategy, i.e., immediate modifications. However, 
as depicted in [20] for aggregate-oriented DBMSs, the lazy or pro- 
active strategies (i.e., based on the history of changes, they predict 
probable near future changing parts) may bring significant benefits 
for selected use cases. So, the search space of options is much more 
complex. 

Currently, there exist approaches that directly learn database 
schema design (for single-model systems) - e.g., the one designing 
the relational schema using the deep reinforcement learning [17]. 
Or, the data structure alchemy [19] defines the design space for 
a key/value store using database knob tuning. Considering the 
multi-model related work again, probably the first and only related 
solution that (lazily, eagerly, or pro-actively) reflects the changes in 
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user requirements has been proposed and evaluated for aggregate- 
oriented DBMSs within systems Darwin and MigCast [20]. An- 
other academic project OctopusDB* can automatically evolve its 
storage and execution architecture over time based on the applica- 
tion’s workload. However, it does not target multiple models but 
multiple systems, namely OLTP, OLAP, streaming systems, and 
scan-oriented DBMSs. 

The problem of detection of errors is solved in many areas, such 
as, e.g., data curation [48] and data discovery [51], i.e., the process of 
discovering data that are relevant for specific tasks, where the data 
first need to be cleaned. However, most data-curation solutions can- 
not be easily fully automated, as they are often ad-hoc and require 
substantial human effort. Hence, using AI techniques learning from 
the history of a dynamic system seems to be a natural approach 
and extension of the proposed framework. 

The described broadening of both input and respective propaga- 
tion of changes will enable reaching the solution’s target robustness. 
First, based on multi-model schema inference approaches, a strat- 
egy inferring the logical multi-model database schema should be 
proposed. It will be designed concerning the given query workload 
and partially utilising the costs of propagation strategies. The prop- 
agation strategies and their cost evaluation can be extended with 
advanced multi-model features, such as modification of multi-model 
indices, inter-model data migration, cross-model materialised views, 
cross-model redundancy etc. In addition, the eager propagation of 
changes can be extended with lazy and pro-active options, as in- 
spired by data migration systems Darwin and MigCast [20] or the 
autonomous self-driving DBMSs [55, 57] and the appropriateness 
of their application experimentally evaluated. The approach for 
detecting changes needs to be extended to detect errors/outliers in 
the data. AI techniques such as anomaly detection can be utilised 
for this purpose. Finally, based on the formal model of the sys- 
tem, possible changes, and their cost, a fully automated approach 
to updating the database schema can be proposed by exploiting 
automated planning techniques. 


3.4 Summary 


To sum up the ideas, in Fig. 3 we depict the process of gradual 
extension of the framework, i.e., its three levels. At the same time, 
it represents several use cases that we cover. As we can see, user 
involvement (red) in the design and maintenance of the multi-mode 
database schema (green) decreases with the growing level. The ini- 
tial setting and the SMOs are gradually replaced by providing only 
input multi-model data to be managed and respective cross-model 
queries. The framework (yellow) becomes more sophisticated (de- 
picted using the blue color) as it adopts AI approaches to mimic the 
user’s decisions. It can also make more complex decisions in terms 
of processing of the input (e.g., detection of outliers), supported 
transformations (e.g., cross-model data migration), or propagation 
strategies (i.e., lazy or pro-active) that exceed the abilities of a hu- 
man DBA, especially considering the world of multi-model Big 
Data. 
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Figure 3: Levels and use cases of the self-adapting framework 


4 CONCLUSION 


This paper aimed to envision a self-adapting evolution manage- 
ment framework that can eventually maintain and design an op- 
timal multi-model database schema and related structures. Since 
such a framework is a complex tool, we propose three levels that 
correspond to three steps, gradually extending a necessary basic 
functionality towards the robust target. In other words, we consider 
various real-world use cases. First, the user-defined changes and 
rule-based adaptation ensure the core functionality. It corresponds 
to the initial situation when the particular use case is not that com- 
plex and still manageable by a skilled DBA. Next, we assume the 
case when the user input is no longer available. The changes need 
to be extracted from the changing data, and the choice of the opti- 
mal propagation strategy needs to be decided using AI strategies. 
Finally, we consider the case when the user input is not available 
from the beginning, and there can be errors/outliers in the input 
data to be detected. Also, the propagation of changes can be smart, 
ie., lazy or proactive, aiming to minimise the impact of the changes. 

The proposed levels and their parts can be solved separately, 
reflecting the needs of a subset of use cases. Also, in many aspects, 
several single-model or system-specific approaches exist that can 
serve as a verified basis. However, the multi-model environment 
requires consideration of more complex situations and their more 
complex solutions. 
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ABSTRACT 


PM2.5 is a pollutant particulate matter with diameter less than 2.5 
micrometer. There exist many stations installed in the world to mea- 
sure its concentration. Some areas without any proper equipment 
nor any station installation must rely on interpolation techniques 
to approximate its concentration. So, there is a need of interpola- 
tion technique to approximate the concentration of the pollutant in 
those areas. The faster and more accurate interpolation technique 
can help identify more polluted areas and thus efficiently take some 
measures to reduce PM2.5 harmful effects. We explored three differ- 
ent neural networks, i.e., Bidirectional-Long Short-Term Memory 
(Bi-LSTM), Gated Recurrent Unit (GRU), Temporal Convolutional 
Neural Networks (TCN), to interpolate the PM2.5 concentration 
over the southeast region of the U.S. We investigate different data 
preprocessing techniques and the effects of spatiotemporal correla- 
tion on the models. We finally compare these models and make a 
choice on the model that is more appropriate for PM2.5 interpola- 
tion. 
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1 INTRODUCTION 


PMy.5 are particulate matters with a diameter smaller than 2.5 pm. 
The indoor PMo2 5 particles originate from indoor burning activities 
such as smoking, cooking, operating fireplaces, and fuel-burning 
heaters [24]. The outdoor PM2 5 particles are the most common. 
They originate from all kinds of burning activities in the envi- 
ronment such as vehicles and machines with fuel-based engines, 
industries, wildfires, and even volcanic eruptions [6]. Most particles 
are formed in the atmosphere as a result of complex reactions of 
chemicals such as sulfur dioxide and nitrogen oxides, which are 
pollutants emitted from power plants, industries, and automobiles 
[6]. They have been studied and proven to be responsible for many 
cardiovascular problems such as irregular heartbeat, aggravated 
asthma, decreased lung function, increase respiratory symptoms as 
irritation of the airways, coughing or difficulty breathing [6, 24, 26]. 

To monitor and reduce the PM2 5 pollution, many countries have 
installed monitoring stations. The obtained data can be later an- 
alyzed and lead to make decisions against PM2.5 pollution. The 
United States Environmental Protection Agency (EPA) established 
Nation Ambient Air Quality Standards for PM25 since 1990 [1]. 
Unfortunately, many polluted places are not equipped with moni- 
toring stations. Sometimes, places with monitoring stations don’t 
record any data for months due to outage or technical issues. To 
estimate the missing or non-recorded data at some spatial locations 
for a specific time, effective interpolation techniques are needed. 

Spatial interpolations can be grouped in two types, ie., point 
interpolation (based on field data) and area interpolation (based 
on entity data) [18]. Point interpolations can be further divided 
into two sub-parts which are the exact point interpolation and 
approximate point interpolation. The most popular exact point 
interpolation techniques are Inverse Distance Weighting, kriging 
and shape functions [11, 12]. Spatiotemporal methods incorporate 
time and space simultaneously by using the known spatiotemporal 
measurements in the interpolation algorithm [13, 14, 21]. 

Qiao et al. [19] applied a hybrid of wavelet transform, stacked 
autoencoder and LSTM to predict PM2.5. Bamane et al. [3] used 
linear interpolation, spearman’s rank-order correlation, LSTM and 
stacked LSTM, Bi-LSTM to estimate PM2.5 concentration. Huang 
et al. [7] proposed an ensemble of Convolutional Neural Network 
(CNN) and LSTM to estimate the concentration of PM2.5. These 
models fails to capture the spatiotemporal correlation in the dataset. 
Tong et al. [25] proposed a spatiotemporal technique that takes 
in account the relationship between each monitoring site and its 
k-nearest neighbors and also between each monitoring site and its 
own measurements at different days. 
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We conduct a comparative study of three different neural net- 
works, i.e., Bidirectional-Long Short-Term Memory (Bi-LSTM), Gated 
Recurrent Unit (GRU), Temporal Convolutional Neural Networks 
(TCN), to estimate the concentration of PM2.5 over the southeast re- 
gion of the US. We explore multiple data preprocessing techniques 
and the effects of spatiotemporal correlation on the models. 

In the sequel, Section 2 introduces Bi-LSTM, GRU, and TCN 
in detail. Section 3 describes the dataset and the preprocessing 
techniques. Section 4 shows our experimental framework and setup. 
Section 5 gives the results of our experiments. We conclude in 
Section 6. 


2 METHODS 


A Recurrent neural networks (RNN) is a development of neural 
network with a feedback (recurrence) to itself. While non-recurrent 
unit can not receive inputs from it previous state (timestep), a 
recurrent unit receives inputs from its previous timestep and out- 
puts a feedback data for its next timestep. RNN can encode depen- 
dencies between inputs but has a problem when handling long 
data sequences. When encoding long data dependencies, the back- 
propagation of the signal goes through multiples layers of neural 
networks. As the signal travels through more layers using certain 
activation functions, the gradients of the loss function increases 
exponentially or vanishes, making the neural network unable to 
learn. This is called the vanishing gradient problem. The simplest 
solutions are to use the right activation functions or perform a 
batch normalization. A batch normalization is a technique to stan- 
dardize the data. It is believed to make the neural network stable 
and faster during the training. 

Long short-term memory recurrent neural networks (LSTM) net- 
works are built to overcome problems associated with the long-term 
problems associated with the RNN [23]. It comprises of different 
gates. There is an input gate for the input layer, a forget gate, and 
an output gate [16]. The cell state and the gates are the core concept 
of LSTM. This is because the cell state allows for the transporta- 
tion of relative information to the sequence chain. It serves as the 
memory of the network. The cell state makes it possible to store 
and transport relevant information throughout the processing of 
the sequence. Necessitating the availability of information from 
past steps and ensuring they get to later steps. Hence reducing the 
effects of short-term memory [7]. The cell state gets more infor- 
mation throughout its journey. Some information is also removed 
via the gates. The gates decide what to retain and what to forget in 
this journey. 

Gated Recurrent Unit (GRU) is a recurrent neural network similar 
to LSTM but lacks an output gate and has fewer parameters which 
aims to solve the vanishing gradient problem experienced in LSTM 
[16]. GRU only have hidden states. It also only has two gates, a 
reset gate, and an update gate. The gates are regulating the flow of 
information flowing through and that allow the GRU to solve the 
vanishing gradient problem of a standard RNN [4]. The update and 
reset gate are vectors deciding the information that goes through 
to the output [7]. Those gates store and filter the information. The 
update gate in the model is used to determine the degree of the past 
information from past steps to carry to the future. This means that it 
can copy relevant information from the past and get rid of the risk of 
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the vanishing gradient. The reset gate decides how much of the past 
data is irrelevant and worth forgetting [20]. Both the update and 
reset gates use the same formula. The only difference is the weights 
and gates usage. The gates affect the final output of the model. The 
last step in the unit consists of a vector that transfers information 
to the network from the current unit. To transfer information, the 
update gate is needed because it is responsible for determining what 
to collect from the current memory content [7]. GRU eradicates 
the vanishing gradient problem because it basically does not wash 
out the new input every single time but rather stores the relevant 
information and passes it down to future-forward steps of the 
network. Having also fewer operations allows GRU to be faster to 
train compared to LSTM. 

Bidirectional recurrent neural network (Bi-LSTM) occurs when 
Long Short-Term Memory (LSTM) and Bi-directional Recurrent 
Networks (Bi-RNN) are combined. This structure makes it possible 
for networks to access backward and forward information on se- 
quences at every level. The Bi-RNN can handle inputs information 
from both the back and the front. Bidirectional allows the two in- 
puts to operate one from the future to the past and another from the 
past to the future. Recent years have seen an increase in approaches 
combining recommendation systems and deep learning [7]. Merg- 
ing the LSTM and Bi-RNN increases the storage capability in LSTM 
cell memory and the access information abilities of Bi-RNN hence 
making the Bi-LSTM better [23]. Bi-LSTM ability to handle data 
with long-range dependence allows them to improve performance 
on sequence classification problems [4]. 

Convolutional Neural networks (CNN) is similar to a feedforward 
neural network with the difference that a CNN has one or more 
convolutional layers. [2] A convolutional layer is similar to a hidden 
layer of an FNN but it uses one or more filters. Filters have input 
weights and generate an output neuron. The goal of the convo- 
lutional layer is to extract features mostly from input image and 
preserve the spatial relationship by learning features using small 
squares of input data [17]. CNN were first employed to learn the 
correlation between image and sentence [22]. However, a variation 
of CNN called Temporal Convolution Networks (TCN) has been 
developed for sequence modelling tasks. 

Temporal Convolutional Neural Networks (TCN) consist of dilated, 
causal 1D (one dimensional) convolutional layers used to convulse 
the output with the past elements of a sequence. This allows TCN 
to be effective in sequence predictions hence their utilization in 
weather predictions [15]. A 1D convolutional network takes as in- 
put a 3-dimensional tensor and also outputs a 3-dimensional tensor. 
One single 1D convolutional layer receives a unique input tensor 
and outputs a tensor of similar unique traits to ensure an output 
tensor has the same length as the input tensor, zero paddings could 
be applied [9]. In a forecasting model, the value of a specific entry 
in the output depends on all previous entries in the input. This 
is made possible when the receptive field has the same size input 
length. Most convolutional hidden layers end with a pooling layer 
whose job it is to distill the output of the last convolutional layer 
to the most important elements. Temporal convolutional networks 
do not have time steps. They treat the temporal data as a sequence 
over which convolutional read operations can be performed. TCN 
are combined with RNN in the segmentation of video-based action 
by filtering low-level features which are responsible for encoding 
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spatial-temporal information and segregating features into a classi- 
fier capturing high-level temporal information using RNN. At least 
two convolutional layers are needed for video-based segmentation 
[10]. The TCN approach for catching two levels of information 
hierarchically is known as the encoder-decoder framework [8]. 


3 EXPERIMENTAL DATASET 


The dataset is a daily measurement of PM2.5 at Florida in 2009. 
There are 19,475 rows of data, which were collected from 53 U.S. 
EPA’s Air Quality System (AQS) monitoring sites. Each row consists 
of 5 columns, i.e., site id, timestamp, latitude, longitude and daily 
concentration PM2.5 measurement. Sample raw data is shown in 
Figure 1. The time range for this dataset is between January 1st 
2009 and December 31st 2009. 


id year_month_day longitude latitude pm25 
() 128010023 12/30/2009 -82.387778 29.706111 7.3 
a 120010023 12/27/2009 -82.387778 29.706111 6.5 
2 120010023 12/27/2009 -82.387778 29.706111 6.6 
3 128010023 12/24/2009 -82.387778 29.706111 5.5 
4 120010023 12/18/2009 -82.387778 29.706111 2.6 


1/16/2009 -84.161111 30.092500 
1/4/2009 -84.161111 30.092500 


19469 121290001 6 
6 
1/4/2009 -84.161111 3@.092500 6. 
5 
5 


19478 121290001 
19471 121290001 
19472 121290001 
19473 121290001 


1/1/2009 -84.161111 30.092500 
1/1/2009 -84.161111 3.092500 


WWwWwrR wos 


[19474 rows x 5 columns] 


Figure 1: Original Data 


We observe that some sites have duplicates on some daily records 
and some sites don’t have some daily records. The dataset is pre- 
processed for the models to better capture the spatial and temporal 
relations. 

We first drop the duplicates, then group the dataset by id and 
keep the first 32 ids with the most daily measurements. After that, 
we calculate the mean of PM»2>5 concentrations at each site. We 
then group the data by the site ids and iterated to compare the 
date in each group to the 365 days of the year. If a certain day 
is not present in the group of same id, we create a row with the 
missing day and complete the rest of the columns with the mean 
of PM2.5. Such function is used to fill the missing days for all of 
our dataset. We assume the PM» 5 concentrations at one site is 
related to the PM2.5 concentrations at neighboring sites. Thus, a 
k-d tree-based k-nearest neighbor algorithm is applied to find k 
nearest neighbors for each site and then we reshape the dataset by 
treating these neighbors’ features as new features for the current 
site. In our experiments, we assume k € {1, 2, 3,4, 5,6}. Once the 
preprocessing is complete, we have 6 datasets, one for each k. 


4 EXPERIMENTS 


Four performance measures are used. They are the mean absolute 
error MAE, the root-mean-square error RMSE, the mean absolute 
percentage error MAPE and the mean-square error MSE, whose 
computations are as follows: 
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N = Number of evaluation samples 
Oj; = Observed concentrations of particles 
P; = Predicted concentration of PM2.5 particles 
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Figure 2: The Bi-LSTM, GRU and TCN networks 


We design one deep Bi-LSTM and one GRU recurrent neural 
networks (Figure 2) following mostly the same parameters. Each of 
the two recurrent networks is stacked by 2 layers and finally one 
dense layer. We randomly and uniformly initialize the the kernel’s 
weights using random uniform for both the Bi-LSTM and GRU 


models. We apply the sigmoid activation function o(x) = Ts 
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MAPE as the loss function and the adam algorithm as an optimizer. 
The TCN was built with one and later with two deep convolutional 
layers (Figure 2). Both convolutional layers have each a kernels size 
and a dilation rate that varies for each dataset. A max-pool is added 
after the second convolutional layer to help reduce some feature 
dependencies. After the max-pool, the outputs are flattened to pass 
into a dense layer with one neuron as our output. 


RAW DATASET 
DATASET 1 DATASET 2 


CALCULATE NEAREST NEIGHBORS 


I 


DATA NORMALIZATION AND 
SCALING BETWEEN O and 1 


I 


SPLIT DATA INTO TRAIN AND TEST 
SET 


Splitting t 1 
RESHAPE TRANING SET 
INTO 3D-X AND 2D-y 


Build-Train Models 


Preprocessing 


RESHAPE TEST SET 
INTO 3D-X AND 2D-y 


BI-LSTM GRU TCN 


Prediction 


MAPE, RMSE, MSE, MAE 


Figure 3: Flowchart of the Model 


While training the neural networks, some of the models were 
overfitting as the performance on the training set was better than 
the validation set. To solve such problems, we add tested multiple 
techniques such as reducing the number of units, adding dropout 
and use the early stopping method which consists of terminating 
the training when the error reaches a certain threshold. 

The framework of our model is shown in Figure 3. When imple- 
menting a neural network, it is important to normalize and scale 
the data to avoid overfitting. In our case, it is even more important 
to normalize and scale the data between 0 and 1 because we are 
working with outliers and sigmoid activation function. 
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We set up two types of experiments. The first type of experiments 
is to see if a neural network can improve based on the number 
of neighbors and the number of influencing days. For all neural 
networks implemented, each dataset was divided in three sets: 81% 
for training, 9% for validation, and 10% for testing. The second 
type of experiments is to compare the three neural networks in our 
model. 

For all experiments, we implement them on Windows 10 with 
the processor “Intel(R) Core(TM) i7(R)-10510U CPU @ 1.80GHz” 
16 GB DDR4 system memory, and GPU “2 GB NVIDIA GeForce 
MX250”. 


5 EXPERIMENT RESULTS 


A neural network is not trained with all the training data all at 
once. Instead, the data is divided into fixed batches which are fed 
sequentially to the layers. Small batch sizes are proven to make the 
training loss converge faster in few epochs while bigger batches 
can be processed in parallel hence are computational efficient [5]. 
To choose the optimal batch size, we try several batch sizes and 
choose the optimal batch size based on MAPE. 

During one epoch, all the batches are used only once to train the 
neural network. If a neural network is trained on small number of 
cycle, it will not learn enough. So, we might think that we should 
use more epochs to learn more which might be the solution. While 
it is common sense to think that the more you train the better you 
learn; but that’s not the case for neural networks. A neural network 
which learns too much will overfit meaning that it will learn to 
perform perfectly on the training data and perform poorly on the 
validation or testing data. So we must choose an optimal number 
of epochs to avoid overfitting. 

We first find the optimal number of batch sizes and the optimal 
number of epochs for each neural network in our model. Then we 
explore the influencing days and neighbors for each neural network 
in our model. 


e Bi-LSTM: 
The batch table (upper of Figure 4) shows the relations be- 
tween each batch size and the resulting MAPE. The lower of 
Figure 4 shows the result of training under different epochs. 
We complete more tests and some of our first choice param- 
eters were changing drastically per test. So we always pick 
the three parameters that lead to the three smallest MAPE. 
After that, we complete the same tests again for three more 
times. The parameter with the highest probability of getting 
a low MAPE is chosen. The final Bi-LSTM parameters that 
we choose for training are the following: 
— Bi-LSTM Units per layer = 5 
— Batch size = 16 
— Number of epoch = 32 
— Number of Bi-LSTM hidden layers = 2 
— Number of Dense layer = 1 
— Dropout = 0.0 
We wanted to know how the neural network performs on 
different days. Figure 5 shows the number of days in the 
horizontal axis and the resulting neural network MAPE on 
the vertical axis. MAPE fluctuates but does not decrease 
when the number of days increases from 1 to 6. It means 
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Batch 

Size MAE RMSE MAPE MSE Ex_Time 
4 1.47166 | 2.626252  21.95836 | 6.897202 | 323.5052 
8 1.462372 | 2.618509 | 22.27221 | 6.856591 | 171.4519 


16 1.460063 | 2.618445 | 22.68553 | 6.856254 | 89.42714 


32 1.462206 | 2.621673 | 22.95873 | 6.873168 | 51.70563 


64 1.461789 | 2.621172 | 22.92425 | 6.870543 | 30.12648 


128 1.46046 | 2.619546 | 22.79555 | 6.862022 | 19.59539 


256 1.552644 | 2.690982 | 21.83746 | 7.241385 | 17.14775 


Epochs MAE RMSE MAPE MSE Ex_Time 
4 1.461844 | 2.621236 | 22.92868 | 6.870877 | 9.003222 
8 1.695705 | 2.700662 | 27.72871 | 7.293575 | 11.34329 


16 1.462735 | 2.618752 | 22.25132 | 6.857862 | 17.56079 


32 1.460076 | 2.61849 | 22.69089 | 6.856487 | 29.36996 


64 1.462337 | 2.621829 | 22.96914 | 6.873986 | 53.02854 


128 1.462568 | 2.622095 | 22.98662 | 6.87538 | 94.30881 


256 1.462714 | 2.622268 | 22.99774 | 6.876288 | 81.09458 


Figure 4: Batch Size (upper) and Epochs (lower) for Bi-LSTM 
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Figure 5: Number of influencing days (upper) and number 
of neighbors (lower) for Bi-LSTM 


that a small change in the number of days may not be very 
influential in the error calculation. In general, more days 
lead to a worse prediction because our Bi-LSTM model is 
a single prediction model which predicts one value for all 
stations. Such model makes the operations computationally 
efficient but it is subject to a known common error called 
Error Accumulation which increases the more days we try 
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e@ &Bi-LSTM 


MAPE 


Figure 6: Bi-LSTM 3D representation of t, k and MAPE 


to predict [27]. Figure 5 shows that when we use more near- 
est neighbors, MAPE decreases. So, our Bi-LSTM can learn 
the correlation between the number of neighbors. Figure 
6 is showing the relation between the number of days, the 
number of neighbors and the MAPE. 

Gated Recurrent Unit: 

The process of finding the batch size for the GRU model is 
the same as the Bi-LSTM. The batch table (upper of Figure 
7) shows that MAPE is increasing with the increasing batch 
size. The epochs table (lower of Figure 7) shows that MAPE 
decreases with greater number of epochs. It means that the 
neural network is learning. 


Batch 

Size MAE RMSE MAPE MSE Ex_Time 
4 1.47135 | 2.625978 | 21.96459 | 6.895763 | 270.6237 
8 1.462273 | 2.618448 | 22.27791 | 6.85627 | 130.3596 


16 1.460025 | 2.618315 | 22.66918 | 6.855573 | 68.19878 
32 1.462291 | 2.621777 | 22.96565 | 6.873713 | 37.63035 
64 1.460971 | 2.62024 | 22.85365 | 6.865658 | 23.03905 
128 1.460505 | 2.61964 | 22.80368 | 6.862513 | 13.47705 
256 1.459963 | 2.618018 | 22.6272 | 6.854018 | 10.11925 


Epochs MAE RMSE MAPE MSE Ex_Time 
4 2.278859 | 3.020093 | 38.12708 | 9.120964 | 5.470087 

8 1.481895 | 2.632573 | 23.64478 | 6.930443 | 9.071565 
16 1.462572 | 2.618641 | 22.26074 | 6.857278 | 12.1776 
32 1.459983 | 2.618093 | 22.63824 | 6.85441 | 22.46944 
64 1.462371 | 2.621868 | 22.97174 | 6.874191 | 38.05761 
128 1.462596 | 2.622127 | 22.98874 | 6.875552 | 84.79466 
256 1.462505 | 2.622022 | 22.9819 6.875 151.5951 


Figure 7: Batch Size and Epoch for GRU 


After doing more tests, we pick the optimal parameters as 
follow: 

— GRU Units per layer = 5 

— Batch size = 16 

— Number of epoch = 32 

— Number of GRU hidden layers = 2 

— Number of Dense layer = 1 

— Dropout = 0.3 


20 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


GRU number of influencing days t 


ee eel k 
—e 1 
224 eo 2 
-- 3 
4 
—e 5 
w 2 <4 ine 
=< 
= 
ad | oe ee 
19 4 ~~ 
1 2 3 4 5 6 


GRU Number of neighbors k 


224 


tHtttt. 


w 
ap 
w 
a 


Figure 8: Number of influencing days (upper) and number 
of neighbors (lower) for GRU 
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Figure 9: GRU 3D representation of t, k and MAPE 


Figure 8 shows that the MAPE fluctuates when the number 
of days increases but the change is not significant. Figure 
8 also shows that the more number of neighbors we add, 
the lower MAPE. So the number of neighbors influence the 
error significantly. Figure 9 is a 3D showing the relationship 
between the number of days, the number of neighbors and 
the resulting MAPE. 
e Temporal Convolution Neural Network: 

The way of finding the optimal parameters for the TCN is 
different than the Bi-LSTM and GRU models. TCN layers 
has filters instead of neuron units. Each layer output size is 
different than the input size because TCN unit has dilation 
rates which yields an output different than the input. So, the 
size of input should match the TCN units requirements. TCN 
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input takes the following (number of samples, number of 
steps, number of features). We get an error our input data 
does not match the required input. We also need to manually 
adjust the hidden layers to match the output of each TCN 
layer. 


Size MAE RMSE MAPE MSE Ex_Time 
4 1.575733 | 3.107699 | 23.88338 | 9.657791 | 97.82327 
8 1.399156 | 2.717456 | 23.36261 | 7.384569 | 52.07021 
16 1.517412 | 3.288502 | 23.16181 | 10.81425 | 28.07754 
32 1.552875 | 3.026786 | 24.14521 | 9.161434 | 15.10343 
64 1.483259 | 2.815336 | 30.24784 | 7.926118 | 8.203065 

128 1.502759 | 2.612006 | 27.33314 | 6.822576 | 5.412325 

256 1.594034 | 3.142262 | 28.12091 | 9.873813 | 3.767615 


Epochs MAE RMSE MAPE MSE Ex_Time 
4 2.293906 | 3.895824 | 33.98983 | 15.17745 | 1.621808 
8 1.635687 | 3.339997 | 24.71951 | 11.15558 | 2.35916 
16 1.622271 | 3.320666 | 24.54198 | 11.02682 | 4.110749 
32 1.595795 | 3.308152 | 24.54881 | 10.94387 | 3.429568 
64 1.608586 | 3.312746 | 24.53013 | 10.97429 | 3.404691 
128 1,602984 | 3.319115 | 24.66448 | 11.01653 | 3.614789 
256 1.611614 | 3.322392 | 24.6466 | 11.03829 | 3.441077 


Figure 10: Batch Size (upper) and Epochs (lower) for TCN 


The batch table (upper of Figure 10) shows different batch 
sizes that we tested. These data were obtained with k = 1 and 
t = 1 and kernel size =1. The epochs table (lower of Figure 10) 
shows the influence of different epochs on different errors. 
After multiples tests. We chose some parameters that could 
be used to train and test on all k and t as the following: 

— TCN filters per layer = 5 

— Batch size = 16 

— Number of epoch = 32 

— Number of TCN hidden layers = 1 

— Number of Dense layer = 1 

— Dropout = 0.0 

Figure 11 shows barely no change to the increase of days. So 
the number of days either has no influence on the TCN or the 
increase in days is too small for the network to learn some- 
thing. Figure 11 also shows that TCN notices the change in 
nearest neighbors, MAPE go straight down and up instead 
of keep on going down. The TCN either overestimates or 
underestimate the impact of nearest neighbors on the cal- 
culation of MAPE. Figure 12 shows the relation between tf, 
k, and the resulting MAPE. We can see some sharp changes 
when the number of neighbors change but the changes are 
not coherent. 


Figure 13, Figure 14, and Figure 15 show all the 36 measurements 
for the Bi-LSTM, GRU, and TCN, respectively. By comparing their 
MAPE, we can see that the Bi-LSTM model starts with the lowest 
MAPE followed by the GRU. When we compare the execution time 
of the three, we can see that the TCN leads the other two and then 
is followed by the GRU. By comparing the rate of decrease of the 
MAPE, we see that the GRU is the fastest followed by the Bi-LSTM. 
The GRU is the model that ends with the lowest MAPE. 
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Figure 11: TCN Number of influencing days (upper) and 
number of neighbors (lower) for TCN 
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Figure 12: TCN 3D representation of t, k and MAPE 


We then apply the k-d tree algorithm to find the nearest neigh- 
bors from 1 to 6. After that, we build three neural networks models 
and test the influence of spatiotemporal correlation on the estima- 
tion of MAPE. Once the testing done, we analyze the outputs to see 
the patterns in each result. Our proposed preprocessing technique 
was to help reduce the overfitting of the models and allow them 
to easily find the patterns which can lead to better estimations. 
However the results show that increasing the dataset size with 
dummy data also increases the error estimation and it does not 
effectively solve the overfitting problem. From all the observations 
above, we conclude that Bi-LSTM is optimal for predicting PM2.5 
when spatiotemporal data is not considered. Once the spatiotem- 
poral parameters get included, we see that the GRU model takes 
over and performs better than the Bi-LSTM. GRU is also faster than 


IDEAS2022 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


MAE RMSE MAPE MSE __| Ex_Time(sec) k t 
459993 | 2.618142 | 22.64551 | 6.854667 | _55.80943 1 1 
461335 | 2.620101 | 22.76136 | 6.864928 | 56.27454 1 2 
460469 | 2.619564 | 22.63648 | 6.862117| 61.82934 1 3 
461691 | 2.621185 | 22.72055 | 6.870611 | _77.02002 1 4 
464294 | 2.624533 | 22.93741 | 6.888175 | 93.66734 1 5 
1.463691 | 2.622856 | 22.6768 | 6.879372 | 90.28486 1 6 
466322 | 2.429374 | 21.01706 | 5.901857 | _41.17057 2 1 
466971 | 2.429801 | 21.04703 | 5.903932 | _56.55449 2 2 
465671 | 2.427032 | 21.14339 | 5.890485 | 69.88561 2 3 
1.469165 | 2.434054 | 20.97907 | 5.92462 | 72.63165 2 4 
1.467653 | 2.429117 | 21.17102 | 5.900609 | 83.28967 2 5 
1.4702 _| 2.425392 | 21.49791 | 5.882524 | 90.1072 2 6 
431287 | 2.186599 | 20.85703 | 4.781213 | _42.18239 3 1 
1.431852 | 2.188875 | 20.81414 | 4.791176 | _56.81237 3 2 
1.431469 | 2.18987 | 20.77876 | 4.795529| 61.51445 3 3 
1.433718 | 2.186835 | 20.96872 | 4.782249 | _71.86734 3 4 
433301 | 2.191644 | 20.80796 | 4.803302 | 77.44578 3 5 
.434577 | 2.190473 | 20.902 | 4.798172 | 96.86201 3 6 
.341645 | 1.991223 | 19.53429 | 3.964968 | _43.54749 4 1 
.342976 | 1.988201 | 19.68454 | 3.952945 | _58.86083 4 2 
.342324 | 1.992904 | 19.52814 | 3.971665 | 60.32816 4 3 
.345121 | 1.987113 | 19.1916 | 3.948619 | 69.34163 4 4 
.345576 | 1.988362 | 19.0364 | 3.953585 | 79.61906 4 5 
.344734 | 1.995271 | 19.56675 | 3.981105 | 85.46213 4 6 
1.267253 | 1.841429 | 18.73256 | 3.39086 | 42.00405 5 1 
.270039 | 1.840231 | 18.87385 | 3.386449 | _53.06629 5 2 
.271258 | 1.840058 | 1.93459 | 3.385814 | 67.6055 5 3 
.268376 | 1.843872 | 18.72783 | 3.399864 | 73.26147 5 4 
1.269581 | 1.843815 | 18.77968 | 3.399652 | 35.06481 5 5 
1.270304 | 1.844765 | 18.78271 | 3.403158 | 86.75685 5 6 
1.263632 | 1.780812 | 18.82309 | 3.17129 | _41.75899 6 1 
.263187 | 1.782209 | 18.77605 | 3.176267 | _53.13632 6 2 
1.263059 | 1.782677 | 18.76683 | 3.177938 | _58.75876 6 3 
1.262651 | 1.784428 | 18.71198 | 3.184182 | 68.92715 6 4 
1.262949 | 1.785553 | 18.69949 | 3.1882 77.07594 6 5 
.265009 | 1.785203 | 18.78025 | 3.18695 | 99.19309 6 6 


Figure 13: Measurements for Bi-LSTM 


the Bi-LSTM. Finally, the TCN gets the least accuracy but it is the 
fastest of the three neural networks models. Unfortunately, it is the 
model which could not clearly extract the spatial correlation in the 
data. This is due to its lack of recurrence. Between the three models, 
we choose the GRU because its runtime is multiples times faster 
that the Bi-LSTM and its accuracy improves faster while extracting 
spatiotemporal correlation in the data. 


6 CONCLUSION 


We build three neural networks models and use them to estimate 
the concentration of PM2.5. We experimented two data augmenta- 
tion techniques. Our results prove that using the spatiotemporal 
technique proposed by Li et al. [25], yields better results when we 
used more nearest neighbors. We are able to verify that the Bi- 
LSTM model yields better results when time and data correlation 
are not taken in account. We also show that the GRU model almost 
as accurate as the Bi-LSTM and it even outperforms the Bi-LSTM 
when the execution time and the nearest neighbors are taken in 
account. Finally, showed that even if TCN is the least accurate of all 
three, it is also the fastest of the three models. We think that TCN, 
used as a sequential model should not be used to extract multiple 
spatial correlation features. 
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MAE RMSE MAPE MSE | Ex_Time(sec) k t 
.460167 | 2.618829 | 22.72816 | 6.858263 | 27.64781 1 1 
1.46101 | 2.618735 | 22.55699 | 6.857776 | 30.61233 1 2 
460408 | 2.619187 | 22.55998 | 6.860142 | 36.53509 1 3 
461872 | 2.621681 | 22.77296 | 6.873212 | 44.74497 1 4 
1.46263 | 2.621797 | 22.66992 | 6.87382 | 51.75114 1 5 
463982 | 2.623673 | 22.78229 | 6.883659) 52.51981 1 6 
1.46726 | 2.430976 | 20.97812 | 5.909647 | _29.95058 2 1 
466262 | 2.423684 | 21.31372 | 5.874245 | 36.54653 2 2 
.467085 | 2.431126 | 21.00971 | 5.910374 | 44.92777 2 3 
466668 | 2.425575 | 21.28085 | 5.883414 | 70.28458 2 4 
469701 | 2.434436 | 21.00668 | 5.926477| 56.6749 2 5 
471814 | 2.437049 | 20.98925 | 5.939206 | 60.98529 2 6 
433602 | 2.182926 | 21.0699 | 4.765164 29.6228 3 1 
431654 | 2.190685 | 20.74977] 4.7991 36.05202 3 2 
431581 | 2.189134 | 20.80613 | 4.792306 40.57423 3 3 
1.43427 | 2.186059 | 21.0154 | 4.778855 | 49.37666 3 4 
433586 | 2.190294 | 20.86052 | 4.797389 | 55.44264 3 5 
1.43719 | 2.186604 | 21.13113 | 4.781238 | 59.07152 3 6 
342829 | 1.986124 | 19.73606 | 3.944689 27.86081 4 1 
.342828 | 1.988639 | 19.66536 | 3.954684 | 33.7905 4 2 
.343872 | 1.986762 | 19.77545 | 3.947222 39.2521 4 3 
.343279 | 1.991704 | 19.60463 | 3.966884) 47.0548 4 4 
.346331 | 1.98748 | 19.86306 | 3.950076 | 52.91212 4 5 
.344798 | 1.994077 | 19.60369 | 3.976343 | 58.53346 4 6 
1.26875 | 1.839882 | 18.82926 | 3.385166 | _29.83803 5 1 
268665 | 1.841296 | 1.79379 | 3.390371 35.59691 5 2 
.267651 | 1.843166 | 18.71499 | 3.397262 | 41.31305 5 3 
.267906 | 1.844748 | 18.68749 | 3.403094 | 48.62502 5 4 
.271715 | 1.842058 | 18.90543 | 3.393179 | 55.60395 5 5 
.270822 | 1.844176 | 18.81729 | 3.400983 | 61.99344 5 6 
.262835 | 1.781276 | 18.78265 | 3.172944 | 29.01878 6 1 
.260738 | 1.785016 | 18.61891 | 3.186283 | 35.91692 6 2 
260766 | 1.785434 | 18.61644 | 3.187775 | 43.66646 6 3 
1.26367 | 1.783481 | 18.77214 | 3.180805 | 48.4295 6 4 
262084 | 1.786844 | 18.6374 | 3.192812 | 52.62445 6 5 
.264207 | 1.785989 | 18.73203 | 3.189757 56.07748 6 6 


Figure 14: Measurements for GRU 


In the future, we think that using an ensemble neural networks 


to build hybrid neural networks and adding more features such as 
temperature, wind and other pollutant particles can increase the 
stability and accuracy of our model. We could also experiment on 
the accuracy of estimating one location at a time other multiples 
locations at once. Using an interpolation technique such as linear 
interpolation could have given better results. Finally, we could test 
the efficiency of the model in multivariate domains such as traffic 
data combined with meteorological data which could help better 
understand humans impact on the climate change. 
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ABSTRACT 


Frequent pattern mining is a popular technique in big data min- 
ing and analytics. It discovers frequently occurring sets of items 
(e.g., popular merchandise items, frequently co-occurring events) 
from big data found in numerous database engineered applica- 
tions. These frequent patterns can be discovered horizontally by 
transaction-centric mining algorithms or vertically by item-centric 
mining algorithms. Regardless of their mining direction (horizontal 
or vertical), traditional frequent pattern mining algorithms aim 
to discover Boolean frequent patterns in the sense that patterns 
capture the presence (or absence) of items within the discovered 
patterns. However, there are many real-life situations, in which 
quantities of items within the patterns are important. For example, 
the quantity of items may also affect profits of selling the items 
within the discovered patterns. Hence, in this paper, we present an 
algorithm for vertical mining of interesting quantitative frequent 
patterns. This Q-Eclat algorithm first represents the big data as 
a collection of equivalence classes according to their prefix item 
labels. Each domain item is represented by one of these classes. 
Their corresponding item-centric sets capture (a) IDs of transac- 
tions containing the item, as well as (b) the quantity of that item 
in each transaction. With this representation, our algorithm then 
vertically mines quantitative frequent patterns. When compared 
the existing MQA-M algorithm (which was built for quantitative 
horizontal frequent pattern mining), evaluation results show that 
our quantitative vertical Q-Eclat algorithm takes shorter runtime 
to mine quantitative frequent patterns. 
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1 INTRODUCTION 


Nowadays, big data [1, 2] can be found in numerous database engi- 
neered applications. With advances in technology, high volumes of 
a wide variety of data (which may be of different levels of varsity) 
are generated and collected at a high velocity for numerous real-life 
applications and services such as: 


e medical/healthcare informatics [3-7] and disease analytics 
[8-12]; 

e transportation analytics [13-15]; 

e business analytics [16-20]; as well as 

e social media mining [21-23] and social network analysis 

[24-28]. 
Embedded in these big data is implicit, previously unknown and 
potentially useful information and knowledge. This calls for big data 
management [29-31], big data mining [32-35], big data analytics 
[36, 37], as well as big data visualization and analytics [38-41]. 
Association rule mining is a popular technique for big data mining 
and analytics. It discovers rules that reveal interesting associations 
among the antecedents and consequences of the rules. Generally, 
these rules are mined by first discovering frequent patterns and 
then using these discovered frequent patterns to form the rules. 
Frequent pattern mining [42-45] aims to discover frequently occur- 
ring sets of items (aka itemsets)—such as popular merchandise 
items in shopping carts or shopper market baskets, or frequently 
co-located conferences/events—from big data. Given a series of 
transactions containing a set of items, frequent pattern mining 
seeks to determine the sets of items, which occur in a large num- 
ber of transactions. In addition, we wish to discover interesting 
association rules. Association rules state that whenever a certain 
set of items occurs in a transaction, another set of items tends to 
occur in that transaction. The problems of frequent pattern min- 
ing and association rule mining form the basis of many real-life 
applications such as marketing in business, discovering biological 
patterns, studying human populations, web log mining, and many 
database engineered applications. Frequent pattern mining has been 
extended to the mining of other patterns such as network and graph 
mining [46-48], stream mining [49, 50], uncertain pattern mining 
[51-55], and utility pattern mining [56]. 
Frequent patterns can be discovered horizontally by transaction- 

centric mining algorithms or vertically by item-centric mining 
algorithms [57-62]. The Apriori algorithm [63, 64] is an example of 
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horizontal transaction-centric frequent pattern mining algorithms, 
with which data are represented as a collection of transactions. 
Each transaction captures the presence or absence of items. Alterna- 
tively, frequent patterns can also be discovered vertically. The Eclat 
(Equivalence CLAss Transformation) algorithm [61] is an example 
of vertical item-centric frequent pattern mining algorithms, with 
which data are represented as a collection of equivalence classes 
according to their prefix item labels. Each domain item is repre- 
sented by one of these classes, and the corresponding transaction 
ID set for an item captures which transactions contain the specific 
item. Specifically, the set contains transaction IDs. An advantage of 
such a set representation is that the size of set is proportional to the 
density of the data. Sparse data would lead to a small transaction ID 
set. The algorithm was shown to be efficient as it takes advantage 
of set operations in the mining process. 

Whether to mine frequent patterns horizontally or vertically, 
traditional algorithms aim to discover Boolean frequent patterns in 
the sense that patterns capture the presence (or absence) of items 
within the discovered patterns. While traditional frequent pattern 
mining and association rule mining are useful in many contexts, 
they have a major limitation. This limitation is that in traditional 
frequent pattern mining, we assume that every transaction either 
contains an item or does not contain the item. In other words, an 
item is contained in a transaction 0 or 1 times. For this reason, we 
can also refer to traditional frequent pattern mining as Boolean 
frequent pattern mining. However, in many real-world scenarios, a 
transaction can contain an item more than one time. For example, a 
person at a grocery store may buy multiple apples. To address this 
shortcoming, the notion of quantitative association rule mining 
or quantitative frequent pattern mining [65, 66] was introduced. 
Quantitative frequent pattern mining is essentially an extension of 
frequent pattern mining to allow transactions to contain an item 
more than once. Rather than just trying to find items (which com- 
monly occur in transactions), there is a demand for discovering 
commonly occurring quantities of items. For example, in Boolean 
frequent pattern mining, one may discover that bananas are a fre- 
quently purchased item. In quantitative frequent pattern mining, 
one may discover that customers frequently purchase at least five 
bananas at a time. As another example, the quantity of items may 
also affect profits of selling the items within the discovered patterns. 

Through the discovery of quantitative frequent patterns and 
quantitative association rules, we can obtain more interesting re- 
sults than we would if Boolean association rule mining were used 
instead. In addition to receiving information about which items 
commonly occur together in transactions, we also obtain informa- 
tion regarding how many of each of those items tend to occur in 
transactions. MQA-M algorithm [65] extends the Apriori algorithm 
for mining quantitative association rules with multiple comparison 
operators—i.e., mining quantitative frequent patterns (aka sets of 
item expressions, i.e., itemexpsets for short)—horizontally. 

We present in this paper we present a vertical equivalence class- 
based algorithm to mine quantitative frequent patterns (i.e., item- 
expsets) vertically. The resulting Q-Eclat algorithm first represents 
the big data as a collection of bitmaps. Each item-centric transac- 
tion ID set captures the IDs of transactions containing the item, 
as well as the quantity of that item in each transaction. With this 
representation, our algorithm then vertically mines quantitative 
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frequent patterns. When compared the existing MQA-M algorithm 
(which was built for quantitative frequent pattern mining), evalua- 
tion results show that our quantitative vertical Q-Eclat algorithm 
requires shorter execution time to mine frequent patterns. Our key 
contributions in this paper include our Q-Eclat algorithm and its 
pruning rules. 

We organize the remainder of this paper as follows. We begin by 
presenting the mathematical framework for quantitative frequent 
pattern mining in Section 2. We discuss previously used algorithms 
of interest, such as the Apriori, Eclat, and MQA-M algorithms. Then, 
we formally introduce our Q-Eclat algorithms in Section 3. Pseudo- 
code and an example are provided for the algorithm. Section 4 
contains analysis of the algorithm and evaluation to compare our 
Q-Eclat with related works. Finally, we conclude in Section 5. 


2 BACKGROUND AND RELATED WORKS 


Here, we formally define quantitative frequent patterns and review 
relevant algorithms before describing our algorithm for quantitative 
frequent pattern mining. 


2.1 Horizontal Boolean Frequent Pattern 
Mining with the Apriori Algorithm 

As a common algorithm used in Boolean frequent pattern mining, 
the Apriori algorithm [64] provides the foundation for the MQA- 
M algorithm used for quantitative frequent pattern mining. The 
Apriori algorithm works by finding frequent patterns containing 
one item (i.e., 1-itemsets) first, and then finding patterns of higher 
cardinality (i.e., k-itemsets for k > 2) as the algorithm runs. 

To elaborate, for any positive integer k, let Cy be the set of candi- 
date patterns containing k items (ie., candidate k-itemsets) and let 
Ly be the set of frequent patterns containing k items (ie., frequent 
k-itemsets). Note that Lx © Cx, since all frequent patterns are candi- 
date patterns but the reverse is not necessarily true. The first step in 
the Apriori algorithm is to determine L, (i-e., frequent 1-itemsets). 
This is accomplished by scanning through each transaction and 
counting the number of occurrences of each item in the transaction 
database. The frequent singletons are then discovered to be the 
domain items having at least minsup occurrences. 

For the remainder of the Apriori algorithm, it repeatedly uses 
frequent (k — 1)-itemsets to generate candidate k-itemsets where k 
is an integer with k > 2 (ie., 2 < k € Z), and then discovers which 
of those candidate patterns are frequent. In the main loop of the 
Apriori algorithm, we start by setting k = 2. It initially computes Cy 
from Ly_, by performing a self-join on Ly_1. If the first k—2 items in 
two frequent (k—1)-itemsets in Ly_ are the same, then it generates 
a candidate k-itemset in Cy, containing those k — 2 items and the 
last item of those 2 itemsets. After initially creating Cx, it prunes 
Cx by removing from Cy; any candidate k-itemsets having at least 
one (k — 1)-subset not belonging to Ly_}. 


EXAMPLE 1. IfL2 = {{a, b}, {a,c}}, then Apriori generates {a, b, c} 
€ C3 by joining {a,b} and {a,c}. However, it prunes {a, b,c} from 
C3 because a subset {b,c} ¢ Lz. 


Next, Apriori counts the support (i.e., number of occurrences) 
of each candidate k-itemset in Cy by scanning through each trans- 
action and determining which candidate k-itemsets occur in that 
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transaction. It is of interest to note that there are ways to speed up 
the counting process for the Apriori algorithm [64]. Afterwards, 
let Ly be the set of itemsets in Cy, with a support at least minsup. 
Then, it then increments k and repeats the previous steps. The 
loop terminates when no further patterns can be discovered for 
Ly_ 1. Consequently, it returns x Ly (i-e., union of all L;) as all 
the frequent patterns. 


2.2 Vertical Boolean Frequent Pattern Mining 
with the Eclat Algorithm 


Recall from Section 1 that the Eclat (Equivalence CLAss Transforma- 
tion) algorithm [61] is an example of vertical item-centric frequent 
pattern mining algorithms, with which data are represented as a 
collection of transaction ID sets (i.e., tidsets). Each tidset for an item 
captures which transactions contain the specific item. The presence 
of the transaction ID in the tidset(X) indicates the corresponding 
transaction contains the item X, whereas the absence of the transac- 
tion ID from the tidset(X) indicates the corresponding transaction 
does not contain the item X. 

Let us discuss the difference between the horizontal transaction 
database and the vertical transaction database. Horizontal trans- 
action databases refer to the usual representation of transactions, 
where a set of items is associated with each transaction [63, 64]. 
The Apriori algorithm uses the horizontal representation. On the 
other hand, one can represent the transaction database in a “verti- 
cal" format [61]. A tidset for an item can represent a transaction 
database in a vertical format by adding a transaction ID to the tid- 
set for indicating the presence of the item in the corresponding 
transaction. 


EXAMPLE 2. For transactions ty = {a,b} and tz = {b} ina 
horizontal transaction database, the corresponding vertical repre- 
sentation of the transaction database is {t,} © tidset({a}) and 
{t1, t2} © tidset({b}). 


The Eclat algorithm makes use of the vertical representation 
to mine frequent patterns. Like the Apriori algorithm, let C, and 
Lx be the sets containing candidate and frequent k-itemsets, re- 
spectively. First, the Eclat algorithm discovers which itemsets are 
in Lj. It then computes the support of any candidate 1-itemset by 
counting the number of transaction IDs in its corresponding tidset. 
Mathematically, for a singleton {x}, sup({x}) = |tidset({x})|. After 
computing the support for every item occurring in the transaction 
database, L; contains singletons with a support > minsup. 

After discovering L;, the main loop of the Eclat algorithm is 
be executed in a similar fashion as in the Apriori algorithm. Con- 
sider the first loop iteration with k = 2. The first part of the loop 
involves generating Cy, from Ly_, by using the same candidate 
generation method as in the Apriori algorithm (i.e., performing a 
self-join on Ly_, and pruning the resulting set). Next, it forms a 
tidset corresponding to each candidate k-itemset in Cy. Suppose 
that, for some candidate k-itemset X € Cy, W is a (k — 2)-itemset 
containing the first k — 2 items in X, y is the second last item in 
X, and z is the last item in X. Then, X = W U {y} U {z}. The algo- 
rithm computes the tidset of X as the set intersection of (W U {y}) 
and (W U {z}), ie., tidset(X) = tidset(W U {y}) N tidset(W U {z}). 
Next, it computes the support of each pattern in Cy, by counting 
the number of transaction IDs in the resulting set intersection, ie., 


IDEAS2022 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


sup(X) = |tidset(X)|. The frequent k-itemsets in Ly, are computed 
as the candidate k-itemsets in C, having a support > minsup. At 
the end of a loop iteration, increase k by 1 and continue iterating 
through the main loop (if necessary). The loop stops iterating when 
Ly_, is empty. In a similar fashion to the Apriori algorithm, LU, Lz 
is returned as frequent patterns. 


2.3 Quantitative Association Rule Mining 


For quantitative association rule mining [65, 66], suppose that 
I = {i,,i2,...,im} is the set of all items that can be found in a 
transaction database for some positive integer m € Z*. Then, a 
transaction can be represented as t = {(e1, fi), (€2, f2),.--, (es, fs)} 
for some s € Z* where 


e each item e; € I such that ej # e; whenever i # j, and 

e each quantity fj € Z*. 
The quantitative transaction database is D = (f1, t2,..., tn), which 
is the set of all transactions. Each transaction has a unique transac- 
tion ID. An item-expression—or itemexp for short—is an ordered 
triplet of the form (p6q), where p € I,0 € {=,>, <}, and q € Z*. 
Then, a set of item expressions—or itemexpset for short—can be 
defined as a set X = x1, x2,...,x; for some k € Z* where each 
x; = (p;6iqi) is an itemexp such that p; # pj; whenever i # j. Then, 
t satisfies X if Vi € {1,2,...,k}, dj € {1,2,...,s} such that p; = e; 
and the expression (f;6;q;) is true. If an itemexpset X contains an 
itemexp of the form (p < q) where p € I and q € Z*, then fora 
transaction t to satisfy X, item p must still occur in ¢ at least once, 
even though 0 < q. In other words, the number of occurrences of 
item p € t must be in the interval [1, q]. By including this restriction, 
many itemexpsets are prevented from being considered where an 
item can occur zero times. 


EXAMPLE 3. For a transaction t; = {(a,2),(b,3),(c, 1)} (which 
captures 2 occurrences of item a, 3 occurrences of item b, and 1 oc- 
currence of item c), it satisfies the itemexpset X, = {(a = 2),(b => 1)}. 
However, t, does not satisfy Xz = {(a < 2),(c = 2)} because X2 
requires the quantity of c at least 2 but c only occurs once in ty (i.e., 
quantity of c = 1). Similarly, tz = {(a,1)} also does not satisfy 
X3 = {(b < 2)} because X3 requires 0 < quantity of b < 2 but b does 
not occur in tg (i.e., quantity of b = 0). 


The support sup(X) of an itemexpset X is defined to be the num- 
ber of transactions (in D) satisfying X. Now, let minsup be some 
non-negative real number, i.e., minsup € R* U {0}. Then, X is a 
frequent itemexpset if sup(X) > minsup. 

As association rules can be defined for Boolean frequent pattern 
mining, they can also be defined for quantitative frequent pattern 
mining. For two itemexpsets X and Y, the association rule X >Y 
is interesting if: 

e there are no common items between X and Y, 
e sup(X — Y) = sup(X UY) = minsup, and 


e conf(X > Y)= whe > minconf € [0,1], 


EXAMPLE 4. Association rule {(a = 2)} — {(b < 3), (c = 1)} can 
be interesting if its support and confidence values can be satisfied. 
However, a rule {(a = 5)} — {(a = 3)}, which reveals that a customer 
who purchases exactly 5 orders of item a is likely to purchase at least 
3 orders of a, cannot be interesting because item a is common on both 
sides of the rule. 
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2.4 Horizontal Quantitative Frequent Pattern 
Mining with the MQA-M Algorithm 

As an algorithm for mining quantitative frequent patterns, the MQA- 
M (Mining Quantitative Association rules with Multiple comparison 
operators) [65] is similar to the Apriori algorithm except that it is 
generalized to handle quantitative transaction databases. For any 
k € Z*, let Cy be the set of candidate itemexpsets containing k 
itemexps, and let Lz be the set of frequent itemexpsets containing 
k itemexps. Like in the Apriori algorithm, L, CG Cy. The MQA- 
M algorithm starts by generating C;. Suppose that itemmax[p] 
represents the maximum number of times an item p appears in a 
transaction. 


EXAMPLE 5. Ifa quantitative database consists of two transactions 
ty = {(a,1)} and tg = {(a,3)}, then itemmax[a] = 3 because the 
highest number of times a appears in a transaction is 3. 


Afterwards, for each item p appearing in the quantitative trans- 
action database, add every itemexpset of the form {(p, 0, q)} to Ci, 
where 6 € {=,>,<} andq € {1,..., itemmax[p]}. The algorithm 
computes the support of each itemexpset in C, by iterating through 
the transactions and incrementing the support of an itemexpset in 
C, if the transaction satisfies the itemexpset. Let k = 1. Then, L 
becomes the set of all itemexpsets with a support > minsup. The 
algorithm removes some itemexpsets from L, by using two pruning 
rules [65]: 


(1) Suppose that X contains an itemexp of the form (z < 7), 
where z is an item and r € Z*. This first pruning rule states 
that, if there is another itemexpset Y € L; with the same 
support as X except that (z < r) is replaced by (z < r + 1), 
then Y can be pruned from Lx. 

(2) Suppose that X contains an itemexp of the form (z > r), 
where z is an item and r € Z*. This second pruning rule 
states that if there is another itemexpset Y € Lx with the 
same support as X except that (z > r) is replaced by (z > 
r — 1), then Y can be pruned from Ly. 


Like the Apriori algorithm, the MQA-M also has a main loop. 
It first runs the loop with k = 2. The loop body begins with gen- 
erating Cy from Ly_,. Cx is initially generated using a self-join 
on Ly_,. If two itemexpsets in Ly_, have the same first (k — 2) 
itemexps, then it generates an itemexpset in Cy consisting of those 
(k-2) itemexps and the last itemexp in the two itemexpsets in Ly_1. 
However, it imposes an additional restriction that it does not create 
an itemexpset in Cy where there are two itemexps referring to the 
same item. After the join step, it prunes itemexpsets from Cy with 
a subset containing (k — 1) itemexps where that subset is not in 
Lx_}. It gets Ly from Cy using the same procedure that was used to 
obtain Ly. It uses the two aforementioned pruning rules to removes 
some itemsets from Lz. At the end of the loop body, it increments 
k and repeats the previous steps (if necessary). The loop terminates 
when Lz_, is empty. Afterwards, it returns J, Lz, which contains 
all the interesting frequent itemexpsets. 


EXAMPLE 6. Although L; contains {(a = 1)} and {(a = 2)}, MQA- 
M does not form {(a = 1),(a > 2)} € Co. 
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3. VERTICAL QUANTITATIVE FREQUENT 
PATTERN MINING WITH OUR Q-ECLAT 
ALGORITHM 


3.1 Vertical Representation of Quantitative 
Data 


To represent quantitative transaction databases in a vertical format, 
for each item that occurs in the transaction database, we store it as a 
set of pairs. Each pair contains a transaction ID associated with that 
item and the number of occurrences of the item in the transaction. 
Since we are storing a pair, we can call these sets “pairsets". It 
is useful to convert the quantitative transaction database to this 
vertical format when implementing the Q-Eclat algorithm. 


EXAMPLE 7. A horizontal database containing two transactions 
ty = {(a,1)} and tg = {(a,3)} can be represented vertically using 
pairset(a) = {(t1, 1), (t2, 3)}. Then, 


For quantitative association rule mining, we define tidset(X) of 
any itemexpset X to be the set of transaction IDs corresponding to 
transactions which satisfy X. When X is an itemexpset containing 
at least two itemexps, we can break down X = W U {y} U {z} 
where (a) W is an itemexpset with two fewer elements than X and 
(b) y and z are itemexps. Like tidsets for Boolean frequent itemset 
mining, we have the recursive equation: tidset(X) = tidset(W U 
{y}) Ntidset(W U {z}). We will use this equation to generate tidsets 
for itemexpsets containing at least two elements when running 
our Q-Eclat algorithm. The support of an itemexpset X can be 
computed by counting the number of elements in its tidset, ie., 
sup(X) = |tidset(X)|. 


3.2 Q-Eclat Algorithm 


Here, let us describe how our Q-Eclat algorithm discovers quan- 
titative frequent patterns vertically. For any integer k > 1, define 
C;, to be the set of candidate k-itemexpsets and L; to be the set of 
frequent k-itemexpsets. First, we convert the quantitative transac- 
tion database into a vertical format if it is in its horizontal format. 
The vertical format is useful for computing the tidsets correspond- 
ing to the candidate 1-itemexpsets (ie., C)). The next step of our 
algorithm is to compute all candidate 1-itemexpsets in C;. Each of 
those itemexpsets consists of a single itemexp of the form 


(item, operation, quantity) 


where 


e itemis an item in the transaction database, 
e operator 0 € {=,>,<}, and 
e quantity q € {1,..., itemmax[item]}. 


We compute itemmax([item] as the maximum number of times an 
item appears in a transaction, over all transactions in the transaction 
database. 

After computing C,, we compute the tidsets associated with 
each candidate 1-itemexpset. The tidsets can be computed from 
the vertical representation of the quantitative transaction database. 
We then compute the support of each candidate 1-itemexpset by 
counting the elements in its corresponding tidset. The frequent 1- 
itemexpsets are candidate 1-itemexpsets having a support > minsup. 


28 


Q-Eclat: Vertical Mining of Interesting Quantitative Patterns 


Finally, we remove some itemexpsets from L; based on our two 
new pruning rules, which will be described in Section 3.3. 

Then, we set k = 2 and begin executing the main loop. The 
first step in the main loop body is to generate Cy using Ly_ 1. We 
initially create C, by performing a self-join on Ly_}. If there are 
two frequent (k — 1)-itemexpsets in Ly_, where their first (k — 2)- 
itemexps are the same and their last itemexp refer to different items, 
then we add to C; a candidate k-itemexpset that consists of the 
first (k — 2)-itemexps and the last itemexp of both itemexpsets. 
Afterwards, we prune any candidate k-itemexpset from C; that 
contains a sub-itemexpset with (k — 1)-itemexps that do not belong 
to Ly_,. The next step is to create tidsets corresponding to every 
candidate k-itemexpset in Cy. This can be done using the recursive 
definition for tidsets: 


tidset(X) = tidset(W U {y}) N tidset(W U {z}) (1) 


After computing the tidsets, we compute the support of each can- 
didate k-itemexpset in Cy. Any candidate k-itemexpset having a 
support > minsup is added to Lx. Using the two pruning rules, we 
remove some uninteresting itemexpsets from L, if necessary. After 
pruning L;, we have reached the end of the loop body. Hence, we 
increment k and repeat the main steps again if necessary. The main 
loop stops running once L; is empty. Our Q-Eclat algorithm returns 
Ux Ly, which contains all interesting frequent itemexpsets. 


3.3 Our New Pruning Rules for Q-Eclat 
Algorithm 


As mentioned in Section 2.4, there were two pruning rules (for the 
MQA-M algorithm) to remove unnecessary itemexpsets from Lz 
where integer k > 1. Here, we present two new pruning rules that 
are more general than the original pruning rule. Our pruning rules 
remove some uninteresting itemexpsets, which were not removed 
in the original pruning rules. Let X be an itemexpset in L;. Then, 
our pruning rules are described as follows: 


(1) Suppose that X contains an itemexp of the form (z < 7), 
where z is an item andr € Z*. Our first pruning rule states 
that, if there is another itemexpset Y € Lz with the same 
support as X except that (z < r) is replaced by (z < r +s) 
for some s € Z*, then Y can be pruned from L. 

(2) Suppose that X contains an itemexp of the form (z > r), 
where zis an item andr € Z*. Our second pruning rule states 
that if there is another itemexpset Y € L; with the same 
support as X except that (z > r) is replaced by (z > r — s) 
for some s € Z*, then Y can be pruned from Lx. 


A key difference between the original pruning rules and our new 
pruning rules is that the new pruning rules can handle differences 
in quantity greater than 1. Instead of considering itemexpsets of 
the form (z < r + 1) or (z > r—1), we consider the more general 
cases of (z <r +s) or (z > r—s) for some positive integer s. As 
a result, these rules eliminate at least as many itemexpsets from 
L,; as the original pruning rules. Observed from Example 8, our 
improved pruning rules are more powerful in removing redundant 
frequent itemexpsets. 


EXAMPLE 8. Suppose that Lz contains two frequent 2-itemexpsets 
{(a > 2),(b = 1)} and {(a > 4),(b = 1)} before pruning and they 
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have the same support. Using the original pruning rules used in MQA- 
M, neither itemexpset would be pruned. In contrast, when using our 
new pruning rules, {(a > 2), (b = 1)} would be pruned from Lz. 


EXAMPLE 9. Suppose we set minsup=1 and we have three transac- 
tions in a quantitative transaction database: 


© ty = {a: 2}, 
e tp = {a:4,b: 1}, and 
e ts = {b: 1}. 


Then, we observe the occurrences of each domain item in the transac- 
tion and compute its itemmax: 


e itemmax{[a] = 4 because the highest number of occurrences of 
a ina transaction is 4 (in transaction tz), and 

e itemmax[b] = 1 because the highest number of occurrences of 
b ina transaction is 1 (in both transactions tz and t3). 


To generate C1, we must generate every combination of an item, 
comparison operation, and quantity. There are two items (i.e., a and 
b) and three operators (i.e, =,> and <). For item a, there are four 
quantity values (i.e., from 1 to itemmax{[a] = 4). This leads to a total 
of 1X3 x4 = 12 candidate itemexpsets from item a. Similarly, the one 
quantity value (due to itemmax[b] = 1) leads to a total of 1x3 x1 = 
3 candidate itemexpsets from item b. Consequently, this leads to a 
total of 12+3 = 15 candidate itemexpsets as shown in the first column 
of Table 1. 

For each itemexpset in C1, we compute its corresponding tidsets. 
The support of those itemexpsets is equal to the number of elements 
(i.e., transaction IDs) in the tidset. We present the itemexpsets X in Cy, 
their tidsets, and their supports in Table 1. 


Table 1: Candidate and frequent 1-itemexpsets 


XEC; tidset(X) sup(X) > minsup _ interesting 
{(a = 1)} 0 0 

{(a = 2)} {ti} 1 Vv Vv 
{(a = 3)} 0 0 

{(a = 4)} {to} 1 Vv Vv 
{(a@2=1)} {tr t2} 2 v 

{(a=2)}  {t1, t2} 2 Vv v 
{(a=3)} {tz} 1 v 

{(a>4)} {te} 1 Vv v 
{(a < 1)} 0 0 

{(a<2)} {ti} 1 v v 
{(a<3)} {ti} 1 v 

{(a<4)} {tr t2} 2 v v 
{(b=1)} — {ta, t3} 2 v Vv 
{(b=1)} — {te, ts} 2 v Vv 
{(b< 1} {te, t3} 2 v v 


Among these 15 candidate 1-itemexpsets, only 12 of them satisfy 
minsup = 1. We obtain initial Ly by only keeping these 12 candidate 
1-itemexpsets having support > minsup. They are listed on the fourth 
column of Table 1. 

Then, by using our pruning rules described in Section 3.3, we further 
prune away three more redundant 1-itemexpsets: 

e Prune 1-itemexpset {(a > 1)} by Pruning Rule (2) because both 
{(a > 1)} € Ly and {(a > 2)} € Li have the same support; 
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Table 2: Candidate and frequent 2-itemexpsets 


XEC2 tidset(X) 


sup(X) 


{(a = 2), (b = 1)} 9 
{(a = 2),(b = Df 9 
{(a = 2),(b < 1)} 0 


{(a=4),(b=1)} {ta} 
{(a=4),(b21)} {ta} 
{(a=4),(b<1)} {ta} 


{(a2 2),(b=1)} {ta} 
{(a = 2),(b2 1} {ta} 
{(a2 2),(b< I} {ta} 


{(a24),(b=1)} {ta} 
{(a24),(b2 1} — {ta} 
{a2 4),(6<D} {ta} 


{(a < 2), (b = 1)} (i) 
{(a < 2),(b = 1)} ) 
{(a < 2),(b < 1)} 0 


{(a<4),(b=1)} {ta} 
{(a<4),(b2 1} — {ta} 
{as 4),(6< I} {ta} 


= minsup interesting 

0 

0 

0 

1 Vv Vv 
1 Vv Vv 
1 Vv Vv 
i Vv 

1 Vv 

1 Vv 

i V V 
1 V V 
1 v v 
0 

0 

0 

i V V 
1 Vv Vv 
1 


e Prune 1-itemexpset {(a > 3)} by Pruning Rule (2) because both 
{(a > 3)} € Ly and {(a > 4)} € Ly have the same support; 
and 

e Prune 1-itemexpset {(a < 3)} by Pruning Rule (1) because both 
{(a < 2)} € Ly and {(a < 3)} € L have the same support. 


Hence, the final L; ends up with only 12 — 3 = 9 frequent but not 
redundant 1-itemexpsets: 


e {(a=2)}, 
e {(a=4)}, 
e {(a@2 2)}, 
e {(a= 4)}, 
e {(a< 2)}, 
e {a <4}, 
e {(b= 1D}, 
e {(b = 1}, 
e {(b< 1)}. 


Afterwards, the main loop is executed with k = 2. We begin with 
the generation of C2. The first step in generating C2 is to perform a self- 
join on Ly. In this scenario, this means getting pairs of itemexps where 
the itemexps refer to different items. This yields 6 x 3 = 18 different 
candidate 2-itemexpsets in C2 as shown in Table 2. 

Among these 18 candidate 2-itemexpsets, only 12 of them satisfy 
minsup = 1. We obtain initial Lz by only keeping these 12 candidate 
2-itemexpsets having support > minsup. They are listed on the fourth 
column of Table 2. 

Note that none of these 12 itemexpsets in the initial Lz can be 
pruned by the original pruning rules used in the MQA-M algorithm. 
For instance, recall from Example 8, although both {(a > 2),(b = 
1)} € Lz and {(a > 4),(b = 1)} € Lz have the same support, 
{(a > 2),(b = 1)} would not be pruned. However, by using our 
pruning rules described in Section 3.3, we can further prune away 
three more redundant 2-itemexpsets: 
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e Prune 2-itemexpset {(a > 2),(b = 1)} by our new Pruning 
Rule (2) because both {(a = 2),(b = 1)} € Lz and {(a = 
4),(b = 1)} € Lz have the same support; 
e Prune 1-itemexpset {(a > 2),(b = 1)} by our new Pruning 
Rule (2) because both {(a = 2),(b = 1)} € Le and {(a = 
4),(b > 1)} € Le have the same support; and 
e Prune 1-itemexpset {(a > 2),(b < 1)} by our new Pruning 
Rule (2) because both {(a = 2),(b < 1)} € Le and {(a = 
4),(b < 1)} € Le have the same support. 
Hence, the final Lz ends up with only 12 — 3 = 9 frequent but not 
redundant 2-itemexpsets: 


© {(a = 4),(b = I}, 
e {(a=4),(b 2 1}, 
e {(a=4),(b < 1}, 
e {(a2 4),(b=1)}, 
e {(a> 4),(b= 1}, 
e {(a2 4),(0 < I}, 
e {(a< 4),(b=1)}, 
e {(a< 4),(b>= 1}, 
e {(a< 4),(b < 1}. 

With only two items a and b, no candidate 3-itemexpsets can be 
formed. Consequently, our Q-Eclat returns the nine frequent but not re- 
dundant 1-itemexpsets and the nine other frequent but not redundant 
2-itemexpsets, for a total of 18 itemexpsets as the output. 


4 EVALUATION 


To evaluate our Q-Eclat algorithm, we compared it with the exist- 
ing MQA-M algorithm [65]. The performance of the algorithms is 
assessed using four different quantitative transaction databases: 

e two synthetic datasets: Here, we assume that there are n 
transactions and |J| different items. Each item has a proba- 
bility prob of occurring in a particular transaction, where 
0 < prob < 1. If the item appears in the transaction, then 
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Figure 1: Runtimes of the existing MQA-M algorithm and our Q-Eclat algorithm for quantitative frequent pattern mining with 
various minsup values on synthetic datasets: (a) sparse and (b) dense synthetic datasets. 


the number of occurrences of that item follows a Poisson(A) 
distribution plus 1. We set n = 1000, |I| = 50, and A = 1. 
The values of prob for these two quantitative transaction 
databases are 0.2 and 0.8. These quantitative transaction 
databases considered as sparse and dense, respectively. 

two real-life datasets from UCI ML Repository [67]: Here, 
we modified the chess and mushroom datasets to make them 


quantitative transaction databases. Whenever an item occurs 
in a transaction, instead of it only occurring once, its number 
of occurrences follows a Poisson(A) distribution plus 1. 


More specifically, these two synthetic datasets and two real-life 
datasets are: 


1) sparse synthetic dataset, with prob = 0.2; 
2) dense synthetic dataset, with prob = 0.8; 
3) modified real-life chess dataset; and 

4) modified real-life mushroom dataset. 


Pa ae 
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The two algorithms for quantitative frequent itemset mining 
(i.e, MQA-M [65] and our Q-Eclat) have been implemented in the 
Python language. The algorithms were run on a Windows 10 Nitro 
AN515-55 laptop using an Intel Core i5-10300H CPU at 2.50 GHz 
and 8.00 GB RAM. To keep the comparisons between the algorithms 
fair, we used many of the same functions between the algorithms. 
These include generation of candidate itemexpsets, discovery of 
frequent itemexpsets, and application of our pruning rules on the 
frequent itemexpsets. When we implement the MQA-M algorithm, 
we use our improved pruning rules used in Q-Eclat rather than 
the pruning rules originally used with MQA-M. This allows the 
simulations to emphasize the differences between the algorithms. 

We run the main code for each of the aforementioned quantita- 
tive transaction datasets. For each quantitative transaction database, 
we use a sequence of minsup values. The sequence depends on the 
quantitative transaction database being used to observe interesting 
results that the algorithms did not take too long to run. For each 
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Figure 2: Runtimes of the existing MQA-M algorithm and our Q-Eclat algorithm for quantitative frequent pattern mining with 
various minsup values on real-life datasets: (a) chess and (c) mushroom datasets. 


combination of a quantitative transaction database and a value for 
minsup, the MQA-M and Q-Eclat algorithms were run and timed. 
Reported runtimes were average of multiple runs. 

Figs. 1 and 2 show the runtimes of each of the two algorithms 
for a variety of values of minsup for each of the four quantitative 
transaction datasets. The runtime (in seconds) is shown on the y- 
axis while the value of minsup is given on the x-axis. In all cases, our 
Q-Eclat took shorter runtimes than the existing MQA-M algorithm 
to return the same collections of itemsexpsets. 


5 CONCLUSIONS 


In this paper, we presented our vertical quantitative frequent item- 
set mining called Q-Eclat. This Q-Eclat algorithm first represents 
the big data as a collection of sets of transaction IDs (i.e., tidsets). 
Each item-centric tidset captures the IDs of transactions containing 
the item, as well as the quantity of that item in each transaction. 
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With this representation, our algorithm then vertically mines quan- 
titative frequent patterns. During the mining process, our new 
pruning rules reduce the mining space, and thus shorten the run- 
time. When compared the existing MQA-M algorithm (which was 
built for quantitative frequent pattern mining), evaluation results 
show that our quantitative vertical Q-Eclat algorithm takes shorter 
runtime to mine quantitative frequent patterns. As ongoing and 
future work, we explore ways to further enhance the mining of 
quantitative frequent patterns and to extend this work to mine 
other quantitative patterns. 
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ABSTRACT 


Social networks are becoming increasingly a source of wealth for 
people to connect with others in the society and express themselves. 
These networks store huge amounts of data related to individual 
and collective behavior, and relationships. Despite their importance, 
there exists few research that explains the factors leading to the 
evolution of these relationships, as well as abrupt changes in the 
behavior of individuals in contact. This paper proposes an approach 
based on the topology of social networks to detect early warnings 
of such changes, called weak signals. Our approach is in contrast to 
existing works that focus on analyzing major themes and trends, i.e. 
strong signals, prevalent in a social network at a particular point in 
time. We rely on a temporal interaction graph, and extract patterns 
that characterize weak signals. We demonstrate our approach and 
validate the detected signals through the analysis of social interac- 
tions between individuals of a captive Guinea baboons group, and 
confirm the existence of weak signals prior to the occurrence of an 
aggressive behavior. 
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1 INTRODUCTION 


The tremendous increase in diversity of available data provides 
perspectives to businesses, governments and stakeholders, on what 
is likely to be important or not, which allows them build future 
strategies for decision-making. To build future strategies for better 
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decision-making, the exponential volume of data should be ana- 
lyzed using automated and systematic methods. For example, in a 
business environment, using these methods provides institutions a 
better view of their customers’ opinions, enabling them to develop 
their image or brand better. 

Analyzing social networks data, or the so-called relationships 
between entities, can provide insight and awareness that is absent 
when considering the entities alone. Social network data are created 
from interactions between individuals, interactions amplified by 
personal relationships. These networks have interesting charac- 
teristics in terms of the data values. However, they have specific 
properties (power law distribution, small world, assortativity, pref- 
erential attachment, etc.) that require more sophisticated analysis 
tools that classical information systems approaches do not offer. 
For example, the community structure of social networks is one 
of the fundamental properties. The structure of these networks 
can be used to understand the interactions between entities but 
also to explain events. Indeed, observations have shown that some 
events emerge faster through social networks than through other 
traditional media including Websites, radio and television [31]. 

Recently, the analysis of social networks is focused on predic- 
tions and the effects of current strategies in the future, that are 
becoming a popular and a common interest to marketing, sales, 
and competitive intelligence analysts [10, 20]. Professionals dedi- 
cated to these strategies are able to analyse data that arise on social 
platforms to capture early signs of changes that might present op- 
portunities like awareness and engagement, but even threats on 
the evolution of the environment [22]. Due to the arising volumes 
of these data, we are sometimes unable to see small significant 
clues that act as warnings of important events to come. Finding and 
capturing these clues is even harder if the events we are interested 
in are not known beforehand. Consequently, an adapted and auto- 
mated system is required to exploit and process this data. Detecting 
early signs of change, often known as weak signals, allows policy 
and decision-makers to adapt anticipatory and more effective action 
strategies, rather than responding on the spot to the events as they 
happen. From here comes the need to find a method that relies on 
a graph of interactions between entities. 

In this paper, we aim to present our approach in which we es- 
tablish a method for identifying and interpreting weak signals. We 
hereby propose a method that relies on the network topology, by 
extracting particular network patterns, or so-called graphlets, that 
we consider an operational description of the weak signal. We pro- 
pose to model the social data in the form of a temporal interaction 
graph between entities, and extract these patterns that could be 
indicators of important events in the future. 
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The rest of the paper is structured as follows: Section 2 presents 
some definitions of weak signals from the literature, and methods 
used to detect them in large data sets. In section 3, we introduce 
the proposed approach to identify and validate weak signals in 
social networks: we present a case study carried on a graph of 
temporal interactions between individuals. Section 4 describes an 
experimental setup to measure our algorithm’s performance, then 
outlines the architecture of our approach consisting of different 
layers to support data processing and exploration. Finally, section 
5 concludes the work presented in this paper and discusses future 
directions and perspectives. 


2 BACKGROUND 


Research on weak signals is largely influenced by the work of An- 
soff in the 1970s [3]. He introduced a theoretical definition of the 
weak signal by considering it as a first symptom of strategic discon- 
tinuities acting as an early warning information, of weak intensity, 
which can announce a trend or an important event. In [4], Ansoff 
completes the signal’s definition: a weak signal is a sudden, urgent, 
unfamiliar change in the firm’s perspective which threaten either a 
major profit reversal or loss of a major opportunity. Depending on 
the authors and the domains, synonyms such as hunch, alarm signal 
have been proposed, but also adjectives associated with the signal, 
like early [6], critical [9], vague, etc. Indeed, weak signals can be 
revealed in many domains, from the detection of anomalies in a 
complex system like airplane management [1], to the protection of 
individuals such as the prevention of crimes [7] or harassment [18], 
but also in decision making and anticipation within the framework 
of the strategic planning of companies by using a Return on Expe- 
rience (REX). By summarizing all of the proposed definitions in the 
literature review, we can define a weak signal as an information 
that provides an indication of upcoming or emerging events 
that may have a significant impact on the system. 


Hiltunen’s three-dimensional model [12] was proposed in the 
late 2000s. The author introduced the importance of a signal and 
the graduation of a signal from weak to strong, and also put forward 
the rareness of the signal. This model allowed, by highlighting the 
characteristics of weak signals, to provide an operational definition 
of this concept. We rely on this model in our approach. We translate 
the first two dimensions by the support of the signal which is a graph 
built from the temporal interactions, and a phenomenon/cause! 
which is an announcer of the event. As for the third dimension, 
interpretation, we rely on the expertise of decision-makers to make 
sense of the detected signals. Figure 1, inspired from Hiltunen’s 
model, illustrates our perspective for these three dimensions. 

In order to detect weak signals efficiently, Ansoff [4] suggested 
that they must pass three filters: 1) the monitoring (surveillance); 
2) the mentality; and 3) the power, before potentially triggering an 
action or a decision. The monitoring filter relates to the capacity 
of identifying the weak signal in the midst of all other perceived 
information, by one or more actors within the organization. The 
mentality filter refers to the capacity of recognizing the signal once 
detected. Finally, the power filter refers to decision-making once 
the signal is detected and its relevance recognized. The people in 


14 phenomenon is an observed fact, normal or surprising event. 
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Figure 1: Signal strengthening inspired by Hiltunen’s three- 
dimensional model 


charge in the organization can decide for example not to make this 
signal a priority, despite the underlying risk [29]. 

A survey [28] was done to present a theoretical background on 
the most employed methods and applications in the domain of weak 
signal detection. It should be noted that none of these methods of- 
fer to business experts specific tools for interpreting weak signals. 
These methods can be classified under several categories like statis- 
tics [14], graph theory [19] and machine learning [23]. As we have 
seen, the vast majority of these methods rely on keywords and doc- 
uments analysis to identify some of them as weak signals, using text 
mining [17, 21, 30] and speech recognition. When dealing with data 
issued from social networks, applying the existing weak signals 
detection methods that are based on text mining or topic model- 
ing for example, is not an easy task. The data issued from social 
platforms like Facebook or Twitter for example, consists of short 
texts (for Twitter they are up to at most 280 characters), containing 
abbreviations, spelling errors, special characters, urls, images, etc. 
In the following, we describe our proposed approach to detect and 
interpret weak signals in a temporal graph of interactions. 


3. DETECTING WEAK SIGNALS USING 
NETWORK TOPOLOGY: A CASE STUDY 
WITH SOCIAL INTERACTIONS BETWEEN 
BABOONS 


We rely on a topological analysis of social relations between entities, 
to extract patterns characterizing weak signals. We choose special 
network motifs, graphlets (first introduced in 2004 [26]), as an 
operational description to detect weak signals in a large graph of 
temporal interactions. Graphlets are induced subgraphs? connected 
and non-isomorphic*, ranging from 2 to 5 nodes chosen among the 
nodes of a large graph. There are 30 different types going from Go 
to Gz9*. An essential element in the context of graphlets are the 
orbits [25]. They represent the positions (or roles) occupied by the 


“In graph theory, an induced subgraph is a subset of the nodes and all their edges in 
the original graph. 

3In graph theory, an isomorphism of two graphs G and H is a correspondence between 
the sets of nodes in G and H, such that if two nodes are adjacent in G, they are adjacent 
in H. Graphlets are non-isomorphic because they do not have the same shape. 

‘Tn this document, we use the term graphlet for each type among the 30, however this 
does not represent its occurrence. 
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nodes of these subgraphs. There are 73 different positions (from 
Op to O72) for the 30 graphlets. Graphlets and their corresponding 
orbits are illustrated in figure 2. 


2-node 3-node graphlets 4-node graphlets 


tT 3A TAY DA 


Go G G G; G, Gs : Gg 
5-node graphlets 


: Sapcae nes 


Gy Gio Gy Gr Bie Gig Gis Gi; pad Gio 
65 
Gao on Ga. Gr; Gos 26 iS Gos Gao 


Figure 2: 30 different graphlet types with their orbits, as 
introduced in [26]. 


Indeed, we choose graphlets because they present characteristics 
generally associated with weak signals: 


e They are small patterns consisting of few links between 
nodes; 

e Some of them are rare in a large volume of information; 

e They are however interpretable by business experts by 
means of their predefined shapes and orbits. 


We aim to find a quantifiable property of weak signals based on 
graphlets, while describing a case study performed on a network 
representing a ground truth in a social network. The objective of 
this case study is to use the collected data, to identify and report 
indicators of changes in the behavior of individuals in contact, par- 
ticularly those leading to an aggression. This identification can be 
seen as a source of opportunity to monitor the evolution of a social 
group over time, and probably prevent aggression between individ- 
uals of the group. We provide additional analysis components that 
help stakeholders and experts in interpreting and giving meaning 
to the identified indicators, which wipes out the "black box" effect 
that a fully automated approach could have. To this end, we start 
by describing the data, apply our method of detecting weak signals, 
and finally validate the resulted weak signals. 


3.1 Data presentation 


The raw data set used in this study represents a list of interactions 
between individuals belonging to a group of captive Guinea ba- 
boons [8]. We downloaded the corresponding raw files from the 
SocioPatterns Website that offers free online data sets to the scien- 
tific community®. The data span a time window of nearly a month 
between June and July 2019 and were collected by two different 
methods: 1) behavioral observations by trained human observers; 
and 2) a wearable sensor-based infrastructure. 

To describe the latter infrastructure, the group of 20 baboons was 
fitted with leather collars. Two individuals are considered to be in 
contact during a 20-second interval, if their sensors have exchanged 
at least one packet during this interval, and the contact event is 


Shttp://www.sociopatterns.org/datasets/baboons-interactions/ 


IDEAS2022 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


terminated when the sensors do not exchange any packets during 
a 20-second interval. We consider this infrastructure as the basis 
of the data used as an input to our detection of weak signals. It 
consists of 63,095 interactions in the form of a three-component 
tuple (t, i, ), as shown in the extract of table 1: t represents the 
timestamp® at which the interaction took place, i and j are the 
names of the individuals in contact. 


t i j 
1560396500 ARIELLE FANA 
1560396500 ARIELLE VIOLETTE 
1560396520 FANA HARLEM 
1560396540 FELIPE ANGELE 
1560396540 ARIELLE FANA 
1560396580 BOBO FELIPE 


Table 1: Extract from the raw file of baboon interactions, 
collected by the sensors. 


We are interested to find weak signals indicating a change in 
the behavior of individuals in contact, especially those leading to 
an aggression. Searching for weak signals in this context, offers 
perceptions of possible future situations that might be threats to 
the growth and the development of the society. 


3.2 Identification of weak signals 


Our aim is to describe weak signals with a signature in the form of 
a quantifiable property that characterizes them and helps with their 
detection amidst a large volume of data. We therefore use graphlets 
as an operational tool to establish this signature, characterized 
by the signal’s visibility, diffusion, amplification and rareness. Ta- 
ble 2 provides a description of weak signals characteristics from 
conceptual and operational perspectives. 

Since we are dealing with temporal interactions, we order the re- 
lations by t and divide the original corpus into s snapshots in order 
to study the diffusion and amplification of signals. A snapshot S! 
contains the nodes (baboons) and their relationships that occurred 
during the time interval [i, i+ At[, with At equal to 30 minutes, is 
the same duration of each snapshot. We applied our algorithms 
described below, for each snapshot. Algorithm 1 presents the steps 
followed in our approach, using graphlets for the identification of 
weak signals. First of all, 30-elements arrays are initialized to calcu- 
late velocities and accelerations in snapshot S', and store the results 
of event precursors in candidates and weak signals in WS (lines 2 and 
3). For each snapshot S', each graphlet type Gy, Vx € {0,..., 29} 
is enumerated using the Orca algorithm’ [13]. We choose Orca 
because it provides the exact enumeration of graphlets and orbits 
and it has an acceptable complexity. The result of Orca is stored 
in an array of 30 elements G! where G! [x] contains the number 
of the graphlet G, in the snapshot S! (line 4). The obtained values 
are then normalized using a procedure inspired from the work in 
[11], in which they study the similarity between two queries in 


®timestamp epoch: number of seconds elapsed since January 1, 1970 at 00:00; e.g. "13 
June 2019 03:28:20" corresponds to timestamp 1560396500. 

7https://rdrr.io/github/alan- turing-institute/network-comparison/sre/R/orca_ 
interface.R 
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Dimension/Criterion Conceptual definition Operational definition 


Visibility Number/frequency 
Diffusion Velocity 
Amplification Acceleration 
Rareness Contribution 


Number of graphlets in each snapshot 
Velocity of graphlets calculated w.r.t their number 
Acceleration of graphlets calculated w.r.t their velocity 


Contribution of each graphlet w.r.t the number of all graphlets (ratio calculation) 


Table 2: Conceptual and operational descriptions of weak signal criteria. 


Algorithm 1: Weak signals identification in a snapshot S! 


Inputs :S'th snapshot, G'~! normalized numbers of 
graphlets in snapshot S'~!, Vi-! normalized 
velocities in snapshot S'~}, real k 
Output: Detected weak signals WS 
1 begin 


2 WS < {} candidates — {} 

3 | Vi [NULL]; Ai — [NULL]; 

4 G! — Orca (S‘,5); /* Count 5-nodes graphlets. 

x/ 

5 for x<— 0 to 29 i.e. for each type of graphlet Gy do 

6 G! [x] — Normalization (x, i, G'); 

7 Vi [x] — G? [x] -G! [x]; /* Velocity 
calculation for graphlet type Gy */ 

8 Ai [x] — Vi [x] -V#-! [x] ; /* Acceleration 
calculation for graphlet type G, */ 

9 if Vi [x] > k Or A’ [x] > k then 

10 | candidates — candidates U Gx; 

11 end if 

12 end for 

13 WS < Qualification (Gi, candidates, k) ; 

14 return WS 


15 end 


a temporal database (see algorithm 2 for details of the function). 
Even though the snapshots have same duration, we proceed by 
the normalization to re-scale graphlets number as the number of 
nodes and their corresponding links differ between snapshots (some 
snapshots consist of few links, and others are up to thousands of 
links). 

From the normalized values, we calculate graphlets velocities 
and accelerations, that we use as quantifiable measures for the diffu- 
sion and amplification of signals. Based on these criteria, graphlets 
having velocity or acceleration values that are higher than a pre- 
defined entry threshold k, are selected among the candidates for 
weak signals (lines 6 to 12). To qualify the selected graphlets into 
weak signals, we consider the rareness criterion. The aim here is to 
quantify the contribution of each graphlet to the overall evolution 
of all graphlets using a ratio, to confirm whether they are weak 
signals or not (line 13). This function returns the list of graphlets, 
identified as weak signals. 
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Algorithm 2: Normalization Function 


Inputs : Type of graphlet G,, Graphlet numbers G? at all 
snapshots, number of all snapshots s 
Output: Normalized value 


1 begin 
2 Calculate the mean of graphlet Gx for s snapshots: 
H (Gx) 
3 Calculate the standard deviation of graphlet G,: o (Gx) 
G! [x] - 1(G 
4 Res — CHT tGe) : /*x Normalization */ 
o (Gy) 
5 return Res 
6 end 


Algorithm 3: Qualification function 
Inputs :Graphlet number G! at snapshot S', candidates, 


real k 
Output: List of weak signals 
1 begin 
2 Res — {} ; /* Initialization */ 
3 for x <— 0 to 29 i.e. for each type of graphlet G, do 
G' [x] : : : 
4 R[x] — —————-;;_ /* Contribution ratio 
522, G! [x] 
calculation */ 
5 end for 
6 while Rank(R [x]) < k do 
7 arrg — arrp UG, ; /* Choose top k 
contributions */ 
8 end while 
9 if G, € arrg And G, € candidates then 
10 Res — Res U Gy /*x True positives */ 
11 else if Gy € arrg And G, ¢ candidates then 
12 Res — Res U Gy /* True negatives */ 
13 end if 
14 return Res 
15 end 


Algorithm 3 describes the details of the qualification function. For 
each graphlet type we calculate the contribution ratios at a snapshot 
S! (lines 3 to 5), and rank the resulting values in ascending order. 
The contribution values that are less or equal than k are chosen 
and stored in a list (line 7). Next, we apply the following checking 
rules that aim to maintain the true positives and negatives to be 
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qualified as weak signals. If the graphlet is a candidate and its ratio 
is among the lowest contributions, then it is classified with the true 
positives, thus stored in the weak signals list. If the graphlet is a 
candidate but its ratio is not among the lowest contributions, then 
it is classified with the false positives or false alarms. If the graphlet 
is not a candidate but its ratio is among the lowest contributions, 
then it is classified with the true negatives, thus stored in the weak 
signals list. Finally, if the graphlet is not a candidate neither its 
ratio is among the lowest contributions, then it is classified with 
the false negatives. At the end of this step, we aim to maintain the 
true positives, add the true negatives, and eliminate the generated 
false alarms. So the algorithm returns only the list of true positives 
and true negatives, stored in the weak signals list (lines 9 to 13). 
Figure 3 displays a summary of the needed criteria and the used 
algorithms to identify weak signals in a snapshot S’. 


Snapshot Si 
Gl= (Gio], Git1)......, G29) i 
: Gi vial 
' gO ; 
! Diffusion o : 5 x & é 
: Amplification i : q : Gi Gy Gy 
: SS : ec : S F 
‘Candidates | i : True negatives i 
COYY ALY 
Gis Gs Go Gy | G Gy, Ge Gy 
A /\ x ona True positives i 
GG& Gy G, 
Algorithms 1 & 2 Algorithm 3 


Figure 3: Summary of the steps and algorithms used for the 
identification of weak signals at snapshot S’. 


After applying these algorithms to the baboons data set, with 
respect to the paper’s requirements, we present in table 3 the study 
carried on the single 08:00 a.m. snapshot of 19-06-2019. 


Graphlet Gos Gi4 Gaz G7 Gi 


xXx 


Contribution 0.0198 0.0246 0.0324 0.0360 0.0381 


Table 3: Top 5 graphlets qualified as weak signals in the 8:00 
a.m. snapshot. 


Shape 


3.3. Validation of identified signals 


We use the observations recorded by a human to validate the weak 
signals that we have identified in the preceding subsection. Behav- 
ioral observations were conducted for 5 days per week (between 
June 13 and July 10, 2019) using the focal sampling method [2], 
with two sessions of approximately two hours per day at different 
times each day, ranging from 8:00 a.m. to 5:00 p.m. During each 
session, a trained observer focused on each individual for a period 
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of 5 minutes and recorded their behaviors. The data file recorded 
by this observer contains 5377 interactions, and it is composed of 
seven columns detailed below: 


DateTime: The timestamp of the interaction, i.e. the moment 
of the recording of an action; 

Actor: The name of the baboon; 

Recipient: The name of the baboon on whom the actor acts; 

Behavior: The behavior of the actor. There are 15 different 
types of behavior including ’Rest’, Play with’, ’Grunt-chew’, 
Beg’, Threaten’, Submit’, Touch’, Avoid’, Attack’; 

Category: The classification of behaviors. A behavior can be 
Affiliated’, “Agonistic’ or ‘Other’; 

Duration: in seconds, of the observed behavior, One-off con- 
tacts have no duration; 

Point: Indicates if the contact is a POINT event (YES) or a 
STATUS event (NO). 


We placed ourselves on the same period that we have studied in 
the sensor data set. Table 4 shows an extract of the observed data 
by the human on 19-06-2019 between 08:58 and 08:59 a.m., where 
we noticed a transition from affiliative to agonistic behaviors. As of 
the third row of this table, the behavior becomes agonistic between 
individuals who were interacting quietly a second before (LOME 
and FELIPE), and at 09:11, their interaction returns affiliative. In 
the same data set, we noticed that these agonistic behaviors were 
followed by attacks between VIOLETTE, MUSE, HARLEM and MALI at 
09:17 a.m. 


DateTime Actor Recipient Behavior Category Duration POINT 
19/06/2019 08:58 LOME FELIPE Resting Affiliative 17 NO 
19/06/2019 08:58 LOME ANGELE Resting Affiliative 17 NO 
19/06/2019 08:59 FELIPE LOME Submission Agonistic 0 YES 
19/06/2019 08:59 ANGELE LOME Submission Agonistic 0 YES 
19/06/2019 08:59 ANGELE LOME Attacking Agonistic 0 YES 
19/06/2019 09:17 VIOLETTE HARLEM Chasing Agonistic 0 YES 
19/06/2019 09:17. VIOLETTE HARLEM Submission Agonistic 0 YES 
19/06/2019 09:18 ANGELE VIOLETTE Threatening Agonistic 0 YES 


server, in the morning of 19-06-2019. 


We took back the sensor data set and we moved to a finer analysis 
of graphlets, to identify the baboons appearing in weak signals 
graphlets. The identified baboons are those who participated in the 
aggression reported by the human observer one hour later. 

To do this, we implemented Cypher queries in a Neo4j graph 
database, to select a particular graphlet instance and its belonging 
nodes. Considering the collected sensor interactions which took 
place on 19-06-2019 at 08:00 a.m., figure 4 shows an extract of 
particular instances of the graphlets weak signals (listed in table 3), 
in which we find the individuals mentioned above, in positions that 
are sometimes central (e.g. FELIPE in a), and other times peripheral 
(e.g. FELIPE in b). 

Listing 1 is an example of the Cypher query that returns a G7 in- 
stance, after specifying the labels of nodes belonging to this instance 
(because multiple nodes can occupy the same orbit in different in- 
stances). In this instance, ARIELLE occupies the central orbit O¢o, 
and the remaining individuals notably FELIPE, HARLEM, FANA and 
VIOLETTE occupy the peripheral orbit Ogg. 
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Figure 4: Baboons in particular instances of weak signal 
graphlets at 08:00 a.m. 
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MATCH (u1)--(u2)--(u5) , (u2)--(u3)--(u5) , (u3)--(u4)--(u5) , (u4 
)--(u1)--(u5) 
WHERE NOT ((u1)--(u3)) AND NOT ((u2)--(u4)) 
AND ul.name = 'FANA' AND u2.name = 'VIOLETTE' 
AND u3.name = 'HARLEM' AND u4.name = 'FELIPE' 
AND u5.name = 'ARIELLE' 
RETURN x 


Listing 1: Example of a Cypher query that returns an instance 
of G27 


In this step, we allowed experts to give sense to the detected weak 
signals, by providing contextual elements that reveal the different 
roles occupied by nodes belonging to the orbits of these signals. 
These elements enable them focus their interpretation on particular 
individuals that might be the cause of a future aggression behavior, 
as we have discovered after comparing our results with the human 
observations. 


3.4 Study of community structure 


Identifying community structure in social networks is important 
since it helps in understanding the topological interactions of in- 
dividuals in the network, as well as discovering their shared in- 
formation. At night, the whole group of baboons gather to shelter 
themselves from predators in the trees, during the day they divide 
into small groups. So, we observed the distribution of interactions 
between individuals, on the level of the global graph, all snapshots 
together. We ran the Louvain algorithm [5] on the global graph of 
the sensor corpus to reveal communities of nodes® as illustrated in 
figure 5, and differentiated by red and green colors as shows the 
side table of the figure. The community in red is the one that con- 
tains most of the individuals from the group. Indeed this is normal 
because the individuals who are in it, notably FELIPE, LOME and 
HARLEM, have participated the most in the interactions (whether 
they are affiliative or agonistic). We also found that there are impor- 
tant links between individuals belonging to different communities, 
like FELIPE and ANGELE, or VIOLETTE and HARLEM for example. 
The community detection algorithm is not sufficient to prevent 
attacks, since the attacks exist between individuals of the same 
community, as they exist between two different communities. 


8Community structure is one of a network’s characteristics where nodes belong to 
dense connected sub groups of this network. 
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Figure 5: Two communities detected by the Louvain algo- 
rithm. 


We considered in BEAM an exploration of weak signals via 
quantitative and qualitative features. We applied a refined study on 
specific nodes’ properties (example their type, their position in the 
graphlets, their community) to uncover their structural relation- 
ships if they tend to regroup together or not, and how much they 
contribute in the identified weak signals by the method. 


4 IMPLEMENTATION DETAILS 


In this section, we discuss the detailed implementation of our 
method, as well as the architecture that illustrates our proof of 
concept for weak signal identification and interpretation method. 
We first present a study on the behavior of the graphlets counting 
algorithm Orca, with respect to reasonable graph sizes processed 
by snapshots, on which we relied to measure the performance and 
the response time of the algorithm. 


4.1 Orca running time analysis 


There exist several algorithms to enumerate graphlets and orbits of 
a graph [27]. To choose the most convenient algorithm for count- 
ing graphlets and orbits in the studied graph structures, we have 
defined 3 essential criteria: 1) exact counting of graphlets that are 
up to five nodes, to maintain the interpretability of the results; 2) or- 
bits counting for the study of nodes positions within each graphlet; 
and 3) availability of source code. We relied on the Orca algorithm 
proposed by Hoéevar and Demiar in 2014 [13], which is an exact 
counting algorithm, coming from an analytic approach based on 
matrix representation, and works by setting up a system of linear 
equations per node of the input graph that relate different orbit 
frequencies. Theoretically, Orca can operate on k-node graphlets 
with 2 < k < 5. On one hand, the computation cost grows dramati- 
cally as k increases. On the other hand, it is easier to explore the 
graph when k is small. Considering e as the number of edges and d 
the maximum degree of nodes, its time complexity is of O (ed) for 
four-node graphlets and O (ed”) for five-node graphlets. 

We performed an experimental analysis to evaluate Orca’s imple- 
mentation complexity. The experiment consists of two possibilities, 
by: 1) fixing the number of nodes and increasing the number of 
links; or by 2) fixing the number of links and increasing the number 
of nodes accordingly. 

In the first possibility, we generated random graphs after fixing 
a small number of nodes equal to 200, and increasing the number of 
links starting by 50% of the nodes number. Figure 6 represents the 
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response time of Orca according to the measured density of each 
of the considered graphs. In this figure we notice that there are 
thresholds. Whenever the density of the graph is lower than 0.4, the 
time consumption is less than 10 seconds. For a density between 
0.4 and 0.6, the response time increases from 10 to 40 seconds (as if 
the time is multiplied by four). After a density of 0.85 the response 
time is almost multiplied by two, then remains stable. 
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Links Nodes Density Elapsed Time (in milliseconds) 
500 0.04 118 

600 0.027 85 

1.500 0.004 23 
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Figure 6: Orca’s response time according to 200 nodes graphs 
and an increasing density. 


As for the second possibility, we designed an experimental set to 
measure the performance of the algorithm, while fixing the number 
of links in the graphs and varying the number of nodes to reach 
80% of the number of links. After applying BEAM on several data 
sets of different sizes, roughly the standard graphs for estimating 
the behavior of Orca must have an average of 5.000 links. Therefore 
we designed the experiment according to a fixed number of links 
2.500, 5.000, 10.000 and 100.000 respectively. Table 5 contains an 
extract of the properties for the graphs used in the experiment, 
with links fixed to 5.000 and 10.000 respectively, along with the 
corresponding elapsed time of Orca in milliseconds. We highlighted 
the highest values for Orca’s response time in red, and the lowest 
values in blue. This extract confirms well the theoretical behavior 
of the algorithm in terms of an increased response time w.r.t an 
increased graph density. It is to be noted also that the algorithm 
remains within reasonable times to process data of a snapshot on 
real graphs. 


4.2 Architecture description 


To meet the needs in terms of performance, interoperability (use of 
third party tools to apply complementary analysis to help end-users 
to interpret graphlets) and adequacy between the data structures 
handled by the different algorithms described above, we have spec- 
ified an architecture representing the approach with three layers. 
These layers correspond to: the 1) data storage; 2) the detection; and 
3) the interpretation of weak signals, which communicate through 
third-party tools. Figure 7 illustrates the specified architecture for 
our approach. 

The storage layer contains the raw data collected from the vari- 
ous sources, as well as the resulted data from the two other layers. 
This layer supports different types of source data files as in CSV 
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Table 5: An extract of the experiment carried on to measure 
Orca’s behavior according to a list of fixed number of links. 


or TXT formats, or in JSON format (for example the tweets are 
downloaded from the Twitter API in the form of JSON files’). The 
resulted data from the other two layers, can be stored in two types 
of database management systems. The first system is relational, 
for which we use the PostgreSQL database, and the second one is 
graph-based, for which we use the Neo4j database. 

The detection layer consists of three steps: 1) raw data processing; 
2) candidates selection; and 3) weak signals qualification. This layer 
makes use of third-party tools including R igraph graphs library, 
and Orca algorithm for graphlets counting. It sends the detected 
weak signals to the upper layer, to be interpreted. 

The interpretation layer contains all possible components that 
give sense of the detected signals in the lower layer, hence help an 
expert to determine their relevance for future planning. These com- 
ponents include first the identification of nodes positions within 
weak signals graphlets, thanks to their orbits. These positions are 
provided by snapshot, by node and by orbit (they represent central, 
intermediary and peripheral positions). The components consist 
also of centrality and community measures applied at the nodes 
level. In addition, this layer includes visualization methods that give 
experts insights about the distribution of most important nodes 
for example in weak signal graphlets. They consist of (but are not 
limited to) Sankey diagrams, histograms, pie charts and heatmaps. 
These visualizations can be given at the level of an individual snap- 
shot, and for all snapshots combined. Finally, this layer offers seman- 
tic analysis through an examination of the nodes characteristics, 
their activity and their role in the society. 

These three layers communicate via reproducible workflows that 
we implemented using Jupyter notebooks [24]. Jupyter notebook 
is a tool used for data exploratory analysis. It allows data scien- 
tists to create scripts combining code, text and graphical interfaces. 
Specific kernels for different programming languages run indepen- 
dently and interact with Jupyter, including Python, R, and Scala. We 
consider the Jupyter notebook as an interactive environment that 
allows us to gather a description of the input data, develop with 
multiple programming languages in a single kernel, and then save 
and convert the results to formats other than structured JSON files, 


°https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview 
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Figure 7: Architecture of the platform consisting of three layers. 
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Figure 8: Jupyter notebook of the case study realized on baboons interactions. 


such as HTML and PDF. Figure 8 shows a capture of the Jupyter 
notebook editor representing the described case study in this paper, 
baboons interactions. On the top right of this figure, the type of the 
kernel used to compile the written code is displayed. As mentioned 
earlier, we use the R kernel to execute our scripts. On the left there 
is a navigation panel that shows the sections that compose this 
notebook. The content of the highlighted section (5) in yellow in 
this panel, is shown in the center cell of the editor. The scripts are 
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implemented using R libraries and functions, to create the series 
of snapshots (in the form of temporal graphs) from the baboons 
interaction data. Once the execution of the R scripts in this cell 
is finished, the execution end-time is automatically displayed un- 
der the cell. Below the center cell is another one representing the 
resulted graph of a particular snapshot. Here the number of the 
snapshot is 13, so the result shows the list of nodes interacting in 
this snapshot. 
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Apart from the reproducibility that Jupyter notebook offered 
to our approach, their interactive platform eased the exploratory 
data analysis. It also provided us a way to craft a story with the 
processed data in the different layers of this architecture. 


5 DISCUSSION AND CONCLUSION 


We presented in this paper our proposed approach to identify weak 
signals in social networks, by choosing graphlets as an operational 
description. We first find graphlets in a temporal interactions graph, 
quantifiable using signal diffusion and amplification that charac- 
terize them as candidates. Then, we measure the contribution of 
these graphlets, to qualify the true positives and true negatives, and 
identify the false alarms. An additional step is performed in which 
we use the predefined shapes and orbits of the graphlets, to help 
experts in interpreting the discovered signals. 

We applied our approach on a ground truth data set representing 
social interactions between a group of Guinea baboons, and we 
were able to detect weak signals. By comparing our results with 
those recorded by a human observer of the baboons interactions, we 
confirmed that the detected signals appear prior to an aggression 
behavior, that was reported one hour later by the human observer. 
These results must be shared with the stakeholders and the persons 
in charge, to provide better control on such behavior, and perhaps 
build a preventive strategy in the future. Data and experimental 
programs are available under https://github.com/hibaaboujamra/ 
Weak-Signals-Detection-and-Interpretation-BEAM. Our method 
has also been validated on other Twitter datasets [15, 16]. 

However, our approach still presents few limitations related to 
the filters introduced by Ansoff (detailed in section 2) that hin- 
der the analysis of weak signals and constitute barriers to their 
interpretation. These limitations are mainly linked to: 


(1) the constitution of the study corpus (monitoring filter); 
(2) the interpretation of the detected signals through their recog- 
nition (mentality and power filters). 


The first limitation can be settled by adding a feedback loop 
to allow business experts modifying the data selection filters, if 
for them, the discovered signals are not relevant. The second one 
depends highly on the expert’s sufficient knowledge about the 
utility of the discovered signals. 

We are currently addressing the mentioned limitations of our 
method mainly to ensure controlling the data selection filters, and 
the covering of different detection scenarios. In further research, 
we would like to enlarge our scope and explore the use of clustering 
techniques for the discovery of groups and profiles of nodes in the 
studied temporal graphs. 
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ABSTRACT 


This paper offers a brief history of the information age in order 
to demonstrate how the loss of user control and the increase in 
certain forms of automation have metastasized into imminent and 
ongoing threats to social order and the democratic way of life. 

The internet was established after a number of developments 
which included the interconnection of computers without exten- 
sive need of action by the users. It led to the introduction of user 
communication sub-systems such as text, email, file sharing and 
systems for searching for files. The so called information age is 
said to be marked by the adaption of a hypertext transport proto- 
col in the last decade of the twentieth century. The information 
age was marked by a number of meetings which included the first 
of the world wide web conference in April 1994 followed by the 
second (Oct. 1994) and the third(April 1995) in quick succession. 
Other, by invitation only, meetings which dealt with issue of this 
era were held in Denver, OH(Metadata) and (America in the Age of 
Information)Bethesda, MD. 

However, in just under three decades this information age has 
meta-stasis-ed into a form that is a threat to our social order and 
democratic way of life while fostering division. Enormous wealth 
has been garnered by just a few corporations and individuals at 
the expense of the harm it is doing to people all over the globe. 
This is the result of the spreading of fake-news and favouring 
angry content that result in civil strife and loss of lives. It has 
led to divisiveness and autocratic governments. Some so called 
democracies are in name only with the same people continuing 
in their ‘elected’ position from term to term, ad infinitum. Just as 
in the metastasis of a cancer, until it is checked, this transformed 
internet will destroy some vital parts of our everyday existence: 
our privacy and liberty while promoting an inegalitarian spirit. 
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1 INTRODUCTION 


Humans have evolved over the millenniums, however, it is only 
in the last few centuries that history has been recorded in a form 
that could be easily reproduced and transmitted from generation 
to generation. This was made possible with the invention of the 
printing press, widely considered to be a revolution. Humans have 
gone through a number of revolutions: roughly, the replacement of 
one type of social order by another. 

Most revolutions are periods in a nation when one system is 
replaced with another: the replacement is to oust an autocratic 
government with something more representative[6]. Some are suc- 
cessful, others not so. The revolution of the North American British 
colony was replaced by a democratic federal system with a constitu- 
tion whose interpretation and rigidity has and continues to create 
problems. The Chinese, French and Russian revolutions replaced 
one tyranny by another after wars, suffering, and tens of thousands 
of fatalities. 

Here we are not discussing these revolutions but those which 
have been called industrial revolutions through multiple genera- 
tions of digital computing devices, and development of the internet 
from connecting a number of computers with a small number of 
users to connecting millions of computers and billions of mobile 
devices as clients. We briefly talk about the first two and then turn 
to the subject of this paper namely the metastasis of the internet. 


2 THE INDUSTRIAL REVOLUTIONS 


It took thousands of years before the mechanical advantage offered 
by simple machines were combined ingeniously into more complex 
machines. The power source to drive these machines were either 
human or animal. Other sources such as wind was also harnessed 
over the years. However, a more reliable one was needed. Boil- 
ing water to show the property of steam was developed over two 
thousand years ago and early versions of the steam engine were 
invented in the sixteenth century. It became the driving force of 
the first industrial revolution, said to be from 1760 through 1850. 
The topic of the industrial revolutions is a vast one and is covered 
extensively in many sources[3]. The first industrial revolution, was 
in an era when industrial ownership and its products created a 
class of wealthy entrepreneurs, It also meant a shift from rural to 
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urban environments. Working in badly configured factories was 
a gruelling experience. Workers were not allowed free time and 
worked long hours; the work force included children and the pay 
was dismal. 

During this period radical changes impacted not only agricul- 
ture but transportation and the social structure. The use of steam 
engines in mills and factories and subsequently in railways was 
an achievement of this period. The industrial revolution created 
a new source of great wealth for a handful of entrepreneurs, the 
owners of the factories while exploiting the workers including child 
labour. The industrial revolution led to an exodus of workers from 
agricultural, rural settings to city slums with health issues[19]. 

In the first industrial revolution the source of energy was coal. 
Some historians quibble over the exact boundary between the first 
and the second industrial revolutions, that started around the mid- 
19th century. A primary difference is that the second was the be- 
ginning of mass production in manufacturing and consumer goods. 
The power sources for the second industrial revolution were oil and 
gas, and the internal combustion engine. One can also mention the 
third industrial revolution wherein the power source was electrical 
and subsequently nuclear energy, electric motors, assembly lines 
in automated factories. It has been followed by a revolution, the 
fourth one, in which digital communication technology and the 
internet changed how information and interaction are managed. 


3 THE DIGITAL GENERATIONS 


The idea of a programmable computer is often traced to Charles 
Babbage[13]; he was a mathematician, philosopher, inventor and 
mechanical engineer. (In a recent book[77], one author seems to 
give this honour to the Majorcan polymath Ramon Llull who de- 
signed a machine made of paper.) Humans had to wait another 
century before the vision of Babbage was realized, in the form of 
digital computers, first using electro-mechanical components and 
subsequently all electronic components. 

The centre part of today’s computing and communication tech- 
nology including the internet and mobile phones is the need for 
digital devices and infrastructure. The first generation of digital 
computer systems, which replaced analog systems, were devices 
made from vacuum tubes. They were large in size and required a 
considerable amount of electrical energy. They were relatively slow 
and had very limited storage capacity. 

The next generation of digital computing devices made use of 
solid state devices using diodes and transistors. Each such device 
was distinct; the advantage was lower power requirements and size. 

The third generation of digital systems used integrated circuits 
and hence there was considerable reduction in size and power needs. 
Time sharing and remote access as illustrated here was possible. 
Initially, the communication system was an analog acoustic coupler 
and the input output was an updated teletype. A higher level of 
integrated circuit allowed more powerful digital computers and 
development in operating systems along with features including 
multi-tasking and multi programming. These features allowed time 
sharing a central computer by many users using a dedicated high- 
speed telecom line. 

The emergence of large scale integration and its refinement 
and miniaturization led to the mini and microcomputers, personal 
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computers, laptops, notebooks, tablets and smartphones. This is 
the fourth generation of digital computers. These could be called 
the fourth generation of digital devices. There was speculation 
about the fifth generation of computers, however, with the advent 
of cloud computing and a few super computers, the focus of the 
fifth generation has more or less dropped out of sight with the 
emergence of cloud computing which, ironically, returns to the 
earlier generation as a form of time-sharing++! 


4 THE INTERNET AND WORLD WIDE WEB 


As more powerful computers were introduced in the early 1960s, 
organizations including universities set up a central computer cen- 
tres to house powerful systems. A user who wanted to execute 
her program was required to prepare the program using punched 
cards and bring the deck of these cards to the centre. Common 
programming languages used in the early 1960s were FORTRAN 
and COBOL. The program, along with any data would be in a deck 
of cards which would be submitted to run in batches of programs 
written in the same lanuguage. Once the submitted program for a 
user is run, its output would be printed and both the output and 
the deck of cards would be picked up by the users. The procedure, 
if required, needed to be repeated for any changes to the program 
or data! 

Later, with improvement of the operating systems and the intro- 
duction of telecommunication facilities and an acoustic coupler it 
was possible to use a remote terminal to input the program and run 
it. This would be feasible for small programs but larger programs 
needed the previously mentioned manual procedure: however, satel- 
lite stations with a mini-computer, could be set up to avoid trips to 
the computer centre. These satellite stations were connected to the 
central computer with a high-speed, dedicated telecommunication 
line—-this was the advent of connecting computers! An example of 
this was the use of the central computers located in a downtown 
campus by users in a suburban campus!. 

Systems such as time-sharing in early 1960s allowed many users 
to log into a central system from remote terminals, and store and 
share files on the central disk. Messaging between users of the same 
system also became feasible. Computer-based messaging between 
users of the same system became possible following the advent of 
time-sharing in the early 1960s 

The introduction of the third generation computers in the mid 
1960’s and the use of time-sharing was the period which prompted, 
in May 1964, MIT professor Martin Greenberger to write the fol- 
lowing in an article in The Atlantic[42]: 

“Computing services and establishments will begin to spread 
throughout every sector of American life, reaching into homes, 
offices, classrooms, laboratories, factories, and businesses of all 
kinds.” 

The earliest form of email was introduced on a Unix system, in 
the early 1970s: this allowed users to compose a message and send 
it to the mailbox of other users on the system. With the interconnec- 
tion of computer systems over the early network such as ARPANET 
in the early 1980s. There were a number of dedicated networks 
of interconnected computers in the world until the adoption of 


1The students and faculty members of Loyola College used the central computer 
located at the downtown McGill University in the late 1960s - early 1970s. 
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Figure 1: Remote computing 


The author using an acoustic coupler to communicate with a 
remote main frame -early 1970s 


the packed-based communication (TCP/IP) of digital information, 
standardized in 1982. This led to the emergence of a world-wide 
network of fully interconnected networks all using the protocol[IP]. 
Initially, the internet connected systems at a number of organiza- 
tions (including universities) and was accessed by people at those 
institutes to communicate and share[91]. The Simple Mail Transfer 
Protocol (SMTP) protocol was introduced[90] to send mail from a 
user on one computer to a user on a remote computer. 

The web came into existence in the late 1980s with the develop- 
ment of a hypertext transmission protocol(HT TP)[94], an applica- 
tion of the TCP/IP protocol, and the first text browsers supporting 
the early Hypertext Markup Language(HTML)[93] standard. With 
the introduction of the graphical browser in the early 1990s, data 
sharing was for the first time extended to the masses[94]. 

As noted in [20] “even before the introduction of the web, the in- 
ternet had made it possible for people to communicate via electronic 
mail (email) [69],[88] and on-line chat, allowed sharing of files[87] 
using anonymous file transfer protocol (FTP), news (Usenet News), 
remote access of computers (telnet), Gopher (a tool for accessing 
internet resources), Archie (a search engine for openly accessible 
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internet files) and Veronica (search for gopher sites). These early 
systems afforded the opportunity of interconnecting people (who 
wanted to be connected), sharing resources without requiring any- 
thing in return and providing security and privacy; there was not 
yet any question of monetizing; the whole concept was to share 
without exploitation or expropriation of user data or content. How- 
ever, these systems were not adopted widely: a key limitation of 
these early internet tools was the need to have some computing 
savvy; another challenge was the lack of an infrastructure to trans- 
fer the know-how to novices. This was also a limitation for the 
early web with the use of user unfriendly, text-based web browsers 
and a lack of training facility and easy to learn tools to build and 
maintain hypertext documents. 


Figure 2: WWWI Navigation Workshop 


The author and colleagues during the WWW I - Navigation 
workshop in 1994 -in this forum, the author put forward the ideas of 
web history and search engine. 


Some early attempts to create software for hypertext[98] were 
buried by the emergence of the early tech giants who were more 
interested in having their system dominate the internet and limiting 
users from learning the basics. This strategy of dumbing down is 
behind all current systems, which has contributed to a downgrading 
of literacy and replacement of reading by videos and sound clips. 


4.1 Web of Big-techs 


The web was quickly recognized by business interests as an oppor- 
tunity for commercial exploitation and this led to an explosion in 
the creation of data. The first few meetings of the World Wide Web 
conferences, for example the ones in Geneva (May 1994), Chicago 
(Oct. 1994) and Darmstadt (Apr. 1995), were oversubscribed mainly 
due to participants from business. The web provided new avenues 
for research not only for people in computer science but also in 
all areas of human learning. It has changed the way we do every- 
thing! Using simple words even a naive web user can find and 
subsequently access a large repository of web pages through the 
intermediary of a number of search engines. It is worthwhile to 
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note at this point that the early search systems developed by the 
pioneers of the web have all but disappeared, replaced by late ar- 
rivals. The web, one of the services of the Internet, made it possible 
to create the vision that Vannevar Bush wrote about in 1945[83] in 
less than half a century! 

Marshall McLuhan[65] noted that “The medium is the message” 
in relation to new media, namely radio and television, introduced 
in the early and mid-twentieth century. AM broadcasting was estab- 
lished in the 1920s and FM broadcasting in the 1940s. TV broadcast- 
ing started in a small way in the 1940s. With the advent of the web 
and its appendages, search engines, OSNs(on-line social networks) 
and the popularity of the mobile phone and its integration of the 
internet and web one wonders what the characteristic of it before 
looking at its contents. Most of these OSNs want to be THE internet 
and try to entice its users to be glued to them and never need to 
use anything but their system. 

For a vast majority of users of the the internet, their principal or 
exclusive access point is a small screen with limited user interaction. 
Most web browsers, meant for this restricted media have very 
few user controls. With a limited visio-keyboard the interaction is 
awkward. The traditional menu at the top of a browser display is 
no longer a default and for some browsers it is impossible to access, 
even for more robust desktop versions of the applications. 

Another major application of the web has been the introduction 
of OSNs, and other “platforms” to allow anyone to share personal 
information and news and to express their opinion on any topic’. 
Users seem to have no second thoughts in posting any type of 
personal information. Their lack of sophistication in assigning 
the correct setting for privacy to these postings means that their 
personal information may be accessed by anyone on the system 
and of course by the OSN that hosts this application. The use of 
weak or “easy to guess” passwords do not help. The privacy issue 
has been of secondary importance for many of these OSN operators. 
These operators, through their terms of service, effectively take over 
perpetual ownership of all information and use it for commercial 
purposes and/or to sell to third parties. A privacy bill, recently 
proposed in the US Congress, would offer little help to individuals 
while giving companies great leeway in determining how they 
collect, use and share personal data[58]. 


4.2 Web and Artificial Intelligence 


A proposal was made by McCarthy et. al. in August 1955[62] to set 
up a 20 man-month study in the summer of 1956 with the following 
goal: 

“The study is to proceed on the basis of the conjecture that ev- 
ery aspect of learning or any other feature of intelligence can in 
principle be so precisely described that a machine can be made 
to simulate it. An attempt will be made to find how to make ma- 
chines use language, form abstractions and concepts, solve kinds of 
problems now reserved for humans, and improve themselves.” [62] 


?These users have only to become familiar with this OSN interface and stay ignorant 
of the mechanisms used much less starting a text editor program such as EMACS, and 
type out simple compact web page. 

3The terms of service that the user must have agreed to at the time of signing up, 
would make sure that the OSN is protecting itself and is able to mine and exploit the 
user’s data. 
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“Tt is generally agreed that this was the birth of AI. Recall that 
at that time the first generation of digital computers were around 
for less than a decade and were mammoths (in size and weight) 
with very little memory. More recently, a century long study of AI, 
to be hosted at Stanford University was published[80]. Over the 
past seventy-five years, there has been astronomical increases in 
computing power and storage capacity along with miniaturization; 
the progress in many aspects of computer science has also been 
remarkable. The computing power for a given volume and weight 
has increased by many powers of magnitude. This has enabled more 
complex algorithms, data storage and analysis to be possible. With 
the capitalization of the internet and the enormous potential for 
venture capitalists, the problem of funding was solved. This allowed 
the application and adaption of many approaches including half- 
baked concepts to be realized. The players included new companies 
supported to extend the commercialization of computing and the 
internet to new applications and the replacement of established 
ways of doing things by new ones. Many of these have destroyed 
or are destroying jobs and ways of life and have created tons of 
discarded devices replaced by new ones with different bells and 
whistles. Very little intelligence, artificial or natural, is required to 
see the adverse environmental impact of this madness.”[20, 21, 28] 

Artificial intelligence in the last two decades since the advent 
of the web has focused on the development of intelligent agents 
using reasoning based on statistics collected from a learning data- 
base. With the availability of extensive databases and computing 
capabilities, impressive progress has been made in speech recog- 
nition, image classification, machine translation, locomotion, and 
query-response systems. Most of these are already on-line and used 
by millions. One popular application is the live driving directions 
used by many drivers and which has made obsolete the good old 
map and preplanning of trips. It should be noted that the directions 
given are many times not the most efficient or least polluting ones: 
even though they may take the least time, the directions are not 
the shortest and environmentally friendly. 

Most machine learning algorithms learn using programmed 
rules; whether it be simple or more complex neural networks. Usu- 
ally, the learning algorithm and the programs for it take into account 
all the possible connections in the sample data and additional data 
that is generated by the learning. The problem is that there is al- 
ways the possibility of missing some connections and of course the 
bugs introduced in the programming. Hackers have used the bugs 
and trap doors to create spyware. 

Recently a tech-giant fired an engineer who claimed that the 
software system LaMDA (Language Model for Dialog Applications) 
is sentient. This was his conclusion after testing the system and ac- 
cording to the engineer the system displayed signs of experiencing 
sensation or feeling[45, 61]. According to the tech-giant the system 
is designed to generate convincing human language. One wonders 
about its use in customer support to give an impression to clients 
that they are talking to a human and hence needing fewer support 
staff. 

The concept of conscience (intelligence) in a machine was con- 
jectured by Samuel Butler, in his book Erewhon‘[12]. 


Published in 1872 and digitized in 2005 by the Gutenberg project 
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“There is no security against the ultimate development of me- 
chanical consciousness, in the fact of machines possessing little 
consciousness now. A mollusc has not much consciousness. Reflect 
upon the extraordinary advance which machines have made during 
the last few hundred years, and note how slowly the animal and 
vegetable kingdoms are advancing. The more highly organized 
machines are creatures not so much of yesterday, as of the last five 
minutes, so to speak, in comparison with past time. Assume for 
the sake of argument that conscious beings have existed for some 
twenty million years: see what strides machines have made in the 
last thousand! May not the world last twenty million years longer? 
If so, what will they not in the end become? Is it not safer to nip 
the mischief in the bud and to forbid them further progress?” 

However, this progress is going on. Businesses such as banks, 
utilities etc. are downloading most of the billing and payment op- 
erations to the internet or its surrogate mobile phone. They are 
saving all the mailing costs etc. and the cost of the internet/mobile 
device, plans and bandwidth is to be borne by the customers. None 
of the savings is passed to the customer. It is no wonder that these 
businesses are looking for a sentient system to replace what ever 
customer support they are providing and the big-tech that arrives 
there is going to reap the benefit. Companies are increasingly turn- 
ing to chat-bots to interact with customers. The author had a bad 
experience with these bots and not being satisfied tried all means 
to get the system to get a human to interact without success. Com- 
panies are trying to make these bots more’ human’. Is this what 
seems to be behind the recent story about a bot being, according 
to one engineer, sentient? These companies are developing better 
bots to interact with customers and hence cut their expenses[9]. 

A system such as LaMDA, if it could interact like a human? 
with customers would be a boon. Customer service provided by 
real employees, face-to-face, has been replaced by phone calls®; the 
first is disappearing and even the second is being replaced by a 
repeated message to send the organization an email which would 
be answered in 24 hours. All this transition to save money and not 
to hire telephone receptionists. 

Looking at the case of LaMDA, one wonders if this is the intent to 
provide human like robot service to customers. Victor Frankenstein 
created a human beast which started learning from observations and 
became sentient[78]. He asked Frankenstein to create a companion 
which Victor resisted and for which he paid by his own life. It is 
likely that LaMDA and its clones will continue its evolution. 

The big tech companies behind these bots could replicate these 
systems and could easily adapt the bots for other applications. From 
what one observes, it does not seem they, in the pursuit of profit and 
market domination, have the same reluctance as Victor Franken- 
stein who did not create a companion for the fiend he created[78]! 
It is likely that the internet will continue to metastasize into bot- 
ruled systems that will mimic humane nature. Companies would 
outsource customer service to big-techs instead of the developing 
countries since it would likely be cheaper and garner another big 
bonus for the CEOs and VPs IT! 


5Is it not the objective of big-tech? 
®Which ends up being put on hold and subjected to inane messages. 
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Once bots take over these jobs handled by humans, there would 
be more profit for the business and fewer positions for the increas- 
ing population which is to reach eight billion souls by November 14, 
2020i. Anders [3] had anticipated this situation on an “overmanned” 
world, wondering what will become of these billions of superfluous 
humans. 

However, looking at the number of bugs in most systems and the 
recent incident in Canada when an entire communication network 
that serves millions of users went down during an update which 
likely had bugs,[57] one would have to expect disasters like this 
with a global foot print! 


5 DIGITAL ECONOMY HUBRIS AND GREED 
IN CORRUPT SYSTEMS 


As mentioned above, the development in the internet and connec- 
tivity has led to organization abandoning their computer and IT 
service for having all their data hosted on the cloud and the IT is 
supplied by software houses. The Government of Canada made the 
move to a system called Phoenix and has suffered the consequences. 
Universities have now taken up this initiative under the impression 
that there would be savings. The author’s experiences with these 
systems have not been positive. 

The chances of hacking of such systems are enormous with all the 
built in bugs and back-doors in all software. A recent example of this, 
reported in[79], occurred in the student-tracking software system 
that affected the confidential information of more than a million 
current and former school children. It appears that safeguards are 
not in place: one of the first things that should be considered when 
selecting a system seems to be missing. 

One wonders if sleek marketing techniques had been used to 
sell such systems to eager VP-IT who wanted to take credit for 
the ‘apparent’ savings. We know how the tobacco industry had 
hooked millions of people on tobacco - a difficult dependency to 
overcome[37]. The opioid crisis is another example of the pharma- 
ceutical companies, using marketing directly to medical doctors - 
the prescribers of drugs, and into the bodies of suffering patients 
and hooking them on an addictive pain relief drug[51]: yet another 
example of ‘break things”. When the patients, start dying, the same 
marketing and/or management consultants step in to repair their 
image and try to convert the evil into an opportunity. Behind the 
scenes, marketing and consulting organizations have guided the 
Opioid Crisis[48]. The investigation of tens of thousands of doc- 
uments illustrates the working of consultants for opioid makers. 
Such firms become a trusted adviser to companies manufacturing 
and aggressively marketing opioids which is considered to be the 
cause of hundreds of thousands of lives. Such management con- 
sultants, like those for big tech, helped big pharma to develop a 
strategy for dealing with regulation bureaucracy to seek approval 
for products. 

European Union authorities have been urged to investigate a 
former politician linked to Uber and consider stripping the cab- 
hailing company of access passes to the European parliament, amid 
growing calls to rein in tech lobbyists.[74]. The demand for an 
EU inquiry comes as some politicians consider tighter rules on 
lobbying after the publication of the Uber files, a trove of data 
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leaked to the Guardian and shared with media in 29 countries via 
the International Consortium of Investigative Journalists. 

Big-tech and many other businesses, using lobbying, influence 
peddling and the misplaced belief that the high level leaders of 
innovation must not be stifled with regulations have been able 
to ignore existing regulations, laws and practices. Their modus 
operandi seems to be to break things(ignore everything - and then 
fix things they have broken to their own advantage). This includes 
the time honoured copyright laws etc’. Some of the machinations 
used by some companies have been revealed by what is being called 
the Facebook papers and the Uber papers. 

The emergence of the web as an application of the internet 
ushered in the fourth (or the fifth) industrial revolution. As in 
the very first revolution, it changed many aspects of life as new 
corporations were set up using venture capitals and they were able 
to ignore all tradition, rules and regulations using lobbying and 
bent politicians to get the regulations etc. changed! 

The philosophy used is that of ignoring all norms, traditions, 
regulations and laws. By getting bent politicians and their aides 
on-side they either have these not applied to them or have the 
in-place regulations and laws changed. The many bent politicians® 
would oblige these big-tech companies! Some of them, when inter- 
viewed by the media about the Uber papers, seem to be proud of 
their accomplishment in this connection. They and these new tech 
companies can use their fortunes to hire lawyers and use the courts 
to defer legal recourse. A case in point is the case of a Canadian 
woman, Deborah Douez, who has been battling one of the big tech 
companies now approaching a decade[14]. The bent politicians also 
use this very tool to raise millions of dollars from ignorant people, 
having nothing better to do than follow people like them and be 
taken in by their lies end up sending in contributions to whatever 
cause. Recently one of these is reported to have raised over 250, 
000, 000 USD. The funds are augmented by other billionaires, some 
of them have been using the internet to mint a fortune. 

An organization which was supposed to promote the hyperme- 
dia protocol was taken over by business. One is appalled by the 
sophisticated tracking incorporated in the browsers and the ap- 
plications for mobile devices, all of which have access to the data 
on the phone instead of requesting data if and when required and 
getting the permission of the user for any bit of information. The 
design of these systems allows these breaches in the user’s privacy. 

The Digital Millennium Copyright Act (DMCA), signed into law 
in 1998, provided complete immunity to internet service providers 
and platforms from copyright claims when their users upload or 
share copyrighted material to the platform. Thus the law clears the 
the platform from immediate liability and it is likely that the mate- 
rial would stay for a considerable length of time on the platform[52]. 

The big tech companies [59] have big purses to hire lobbyists, 
finance the politicians’ election campaign, use lawyers to delay and 
fight one and every one[10]. Their only concern is to cannibalize and 
monopolize and at the same time colonize using the US government 
to shield them and lobby the foreign governments[85]. For this they 
either put any competitor out of business or buy up any start-up and 


7Perhaps developing countries should ignore all patents, and copy algorithms while 
improving them! 

8’ They used these OSN to propel themselves to their job and want to continue to keep 
climbing! 
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competition[95—97, 99]. It is likely that many of the people at the 
acquired company may become redundant! The successive Usain 
governments have not blocked any such buyouts as had happened 
with the long list given on the Wikipedia pages mentioned above. 
There seems to be easy access to decision-makers at all level of 
governments by big techs: this access has been used to influence 
the decison makers and most of them are happy to be photographed 
with the CEOs of these big-techs. These corrupt leaders believe 
that these tech giants were providing growth and innovation while 
in reality they were stifling competition and destroying existing 
infra-structure and competition which provides a choice to the 
consumers. 

The US system has abandoned a Global Tax act, which was aimed 
at cracking down on companies evading taxes by shifting jobs and 
profits around the world and the US system is failing to raise tax 
rates on these multinational corporations[72]. For example, some 
of these big-tech companies book its profits in one country as being 
made in another one to minimize its exposure to corporate taxation. 
Most taxation agencies aggressively go after individuals and local 
small businesses, while ignoring major big tech companies[49]. 

Some of the business models used by so many Internet based com- 
panies use the “relies on privatizing profit and socializing risks” [18] 
and exploitation. of one kind or another. They are led by ruthless 
people who seem to have an unquenchable appetite to monopolize 
not only their segment of the business, but looking for opportunities 
to expand their horizons. There are many self-serving politicians, 
policy makers and their aides who see personal gain. They use their 
connections and prestige to continue to influence the government 
even after their term of office end. One needs to look at some of 
the opulent properties some of the ex-leaders have acquired! 

Not long ago, we may have believed that technology would en- 
hance personal freedom and democratic choice. It looked to be so for 
a while! However, technology is starting to shift the global balance 
toward monopolies and autocratic regimes. Furthermore conflicts 
between democracies and autocracies have already started[4]. As 
reported by Amnesty International, the business model of the OSNs 
is threatening human rights[5]. 


6 EXPLOITATION OF THE INFORMATION AGE 


The introduction of web made the internet accessible to lots more 
people along with the rapid use of mobile phones integrated seam- 
lessly to the internet. This was also the start of the spread of mis- 
information! The mis-information is such that it triggers emotions 
in the recipient, who in turn propagates this information, directly 
or automatically thanks to the algorithms used by the OSN! In- 
stead of the media being the message, the aroused users become 
the messengers to others and hence reinforce the mis-information. 

As pointed out by Gunther Anders[3], as in the previous in- 
dustrial revolutions, big tech companies of the information age 
consistently oppose unionization efforts and use the old technique 
of finding jurisdictions with more favourable lax regulations, looser 
workplace requirements, and almost no consequences for breaking 
labour laws[44, 75]. 

The workings of these big tech companies are coming to light 
thanks to thousands of papers leaked by insiders from some of these 
digital economy based conglomerates: it is expected more would be 
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forthcoming in the future for others of these companies in the digital 
economy. What we find is that the OSNs and the big-techs consider 
the user data, actions etc., as a mine to be appropriated, exploited 
and bring to light anything that is concealed or connected[3]. 

The trove of documents released by an ex-employee of Facebook 
reveal, among other things, the role of the company in the Jan. 6 
insurrection in Washington, D.C. and the effect of the company 
around the world. While privately and meticulously tracking the 
hate and divisiveness magnified by this OSN platform, it has not 
heeded warnings from its engineers about the dangers posed by the 
design decisions made for its algorithms with the goal of having 
users stay riveted to and interacting with the site; the OSN chooses 
growth through maximum engagement over user safety. The public 
claims made by this OSN often conflict with internal research. One 
of these is the claim of removing 95 percent of hate speech when 
in reality it is only 5 percent (or did they get the figures mixed 
up)[8, 55]. 

The OSN’s problems with hate speech and misinformation are 
dramatically worse in the developing world. Due to weaker moder- 
ation in many countries OSNs allow their platforms to be used by 
maleficent actors and authoritarian regimes to propagate hateful 
and divisive mis-information. The head of Facebook, who controls 
the majority of the voting shares of the company, told Congress 
that it was “not at all clear” that social networks polarize people, 
when Facebook’s own researchers had repeatedly found that they 
do[34]! 

In an experiment in 2019, a pair of Facebook’s employees set up 
a dummy account for a 21 year old woman in India, the company’s 
largest market. Without any input from this dummy account, the 
feed to it was first filled with pornography and, soon after, it was 
flooded with propaganda favourable for the then prime minster? 
and anti-minority hate speech. One reason could be the “reward” 
changes made in this OSN algorithms. It appears that while the OSN 
pushed into the developing world it didn’t invest in protections 
anywhere near the ones in the US context, themselves woefully 
inadequate[63, 101]. 

While the program called ’free basic’ was initially attempted by 
Facebook in India, opposition forced this OSN to abandon it[21]. 
However it appears that Facebook did not give up this ploy and 
has pushed the ’free basic” program in other countries e.g., Ghana, 
Mexico and Myanmar. Facebook has been able to push this program 
allowing people there to experience this OSN to be the internet 
tout court. As in their failed attempt in India, Facebook partnered 
with local telecom operators in these countries to give free access 
to its own platform along with a bundle of other basic services like 
job listings and weather reports. This scheme has locked millions 
of people into a version of the Internet controlled by this single 
OSN [101]. 

In many OSN applications users are the messengers since they 
can forward messages to their friends and groups without checking 
the authenticity of the message[33]. By having users to spread 
the messages on many of the OSN, these companies are using free 
labour: in addition to having the users invite their friends and family 
to be part of the platform, this creates a positive feedback loop. One 
may understand the small mom-and-pop store using the hosting 
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facility of the OSNs however, one wonders at the mental savvy 
of managers of many large public institutes, such as universities, 
which allow themselves to be part of these platforms by displaying 
the logos of these OSN on their home pages and using the OSN 
free’ hosting facility. 

There are essentially two operating systems for most of the 
computers in the world and just two operating systems for mobile 
phones. These OSs are derived from a version of the very same 
open source operating system, Linux: which is, in tuen, based on 
Unix, an older operating system. However, after years and years of 
development and many releases, they are full of bugs, loopholes and 
trap doors. The case of the journalist who was killed in an embassy 
in Turkey is well known - one of the factors which led up to this 
was a commercial spyware from a company in the Middle East 
which allowed this journalist and others to be spied on through 
their mobile phones[54, 86]. Was the spyware possible due to some 
vulnerability in the the mobile operating system and has it been 
fixed? How easy it is to do something similar with these weak 
systems is further illustrated by the case reported recently of a 15 
year old hacker who was able to break into the mobile system to 
create hacking spyware. This spyware was sold to tens of thousands 
of domestic violence perpetrators[67]. 

Intrusion into the life of users and constant surveillance due to 
the buggy nature of the mobile phone operating systems is made 
by a corporation which terminate employees for giving their opin- 
ion to the world. In the case of a coffee franchise, the application 
tracked the user: most of the data was collected even when such an 
application was not being used. This is an intrusion and a loss of 
users’ privacy. Hardly any user, knowingly, downloads and installs 
such applications for the benefit of the business which offers the 
application nor any third parties. In exchange for one’s privacy, the 
meagre compensation offered by this chain of coffee purveyors is a 
cup of coffee and a donuts [7, 45, 100]. 

After last year’s whistle-blowing revelations relating to practices 
at Facebook, the Uber files published, recently by The Guardian, 
constitute another seminal big tech morality tale[36]. The vast cache 
of documents —leaked by a former key public relations employee — 
offers an insight into a digital giant as it sought to expand at any 
cost. At the same time, it chronicles the complicity of a political 
class which, itself drunk on big-tech concoctions, went along for 
the ride. 

The so called ride sharing service was introduced to provide 
in effect a taxi-service without having to own a fleet of taxis, get 
permits for them from the local municipal authority, pay auto- 
insurance or hire drivers to drive them. The fact that the livelihood 
of tens of thousands of taxi drivers and their licenses, which could 
have cost thousands of dollars would be devalued is completely 
ignored not only by this company but also by the politicians who 
championed this scheme. The scheme was to have individual drivers 
with cars to provide a taxi-service using an application on a mobile 
phone. The way this business was set up is outlined in thousands 
of documents from the company released by a lobbyist who was 
associated with the company: these are called the Uber files[35]. 
According to a Guardian editorial this trove of documents “offers 
a unique and salutary insight into the arrogance and hubris of a 
digital giant”[81]. The approach used by the company, consistent 
with those of the big techs, was to infringe on existing regulations, 
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show the politicians the benefits of the company for their self- 
interest, and have them change the laws or make the laws not 
applicable to them. In the case of Uber, these politicians did not 
recognize that the local and national taxi service could be given 
an opportunity to devise such a software system and provide local 
jobs to software developers and use idle computing from local data 
centres. 

Uber also used a a tool called Greyball, to identify officials act- 
ing as the companies client by using the data collected from their 
own application and others to avoid being detected in many cities 
and countries[50]. This tool was used to dominate this fake taxi- 
business. Uber also treated its drivers as contractors, not providing 
any benefits and by showing them goals that they could achieve 
which pushed them to work longer. They gave away large grants to 
academics to provide them with strategy of feed to the media[40]. 
Looking at the Uber paper, raises a question: Did the politicians 
take orders from the Uber executives[35]? 

Some early enthusiasts of OSNs are finally waking up and utter- 
ing ’mea culpa" as they realize the evil that is being done by the 
meta-stised internet today. The intimacy of the Big-tech companies 
with the Usain leaders was one of the reasons that the mergers 
were allowed to happen: "Obama’s regulators allowed Facebook to 
buy up its biggest competitors — first Instagram, then WhatsApp 
— and failed to crack down on its recklessness with users’ private 
data"[60]. These same leaders used the OSNs reach to get elected, 
re-elected and collect funds. The fundraising bit is evidenced by the 
amount collected by unsavoury players who sacrifice principles, 
reality and their oaths for self-service and imagined wrongs[45, 64]. 

The metastasis of the internet has allowed a handful of indi- 
viduals, exploiting the groundwork done by academicians and re- 
searchers to transform systems that were supposed to allow con- 
nectivity, in such way that is detrimental to all societies and human 
rights including privacy and democracy. 


7 NEED FOR A NEW BEGINNING 


Humans need a better web, better IOTs, better mobile devices, bet- 
ter software, better protection against monopolies, and of course 
better politicians and systems of government. Unfortunately many 
revolutions did not improve the lot of the ordinary person. 

The internet, through the web application and mobile systems, 
has revolutionized the world with over 4 billion people using this 
media. They read news, send emails and text messages are able to 
have video conversations and find answers to questions. Yet when 
these billions participate in online-life most of them rely heavily on 
the services of just two corporations who control operating systems 
for the mobile devices. The mobile devices are used to access the 
services so integral that it is difficult for them to use the internet 
without these devices 

One of the rays of hope is steps taken by the European Union[76]. 
Proposed EU legislation would force internet services to combat 
misinformation and publicize their roles amplifying divisive con- 
tent and stop targeting ads based on ethnicity, religion or sexual 
orientation. The law is an attempt to address OSN’s harm and re- 
quiring them to be more pro-active in monitoring their platform 
for illicit content or risk billions of dollars in fines. Tech compa- 
nies would be compelled to set up new policies and procedures 


IDEAS2022 


Bipin C. Desai 


to rapidly remove flagged hate speech, terrorist propaganda and 
other material defined as illegal by countries within the European 
Union. This law is putting an end to self-regulation which had not 
previously been done since growth was put above monitoring the 
contents! 

Laws such as the above need to be passed in other parts of 
the world. However what is more important is to take this time to 
provide a communication system as a necessary utility for all by the 
public authorities to complement the postal system. Failures such 
as the recent one in the Canadian communication system offered 
by a private, for profit, organization should become a cautionary 
tale [57]. 

Currently, there are just a limited number of for-profit US-based 
corporations, which offer web search, email, mobile and other op- 
erating systems. Software has moved from being ’sold’ to licensed 
to provide a steady stream of income for these companies. The 
existence of open source software is most likely being used as a 
shield to protect these behemoths from being treated as monopolies. 
The unfortunate thing is that many people, even IT professionals, 
do not use open source software! 

As pointed out in [1, 29], there is an urgent need to set up a 
global Software Assurance Agency to certify all software regardless 
of its origin. The concept is similar to CSA[17], UL[82]. All software 
has to go through a certification by this agency. 

Countries should align to put an end to having billions of mo- 
bile devices managed by operating systems controlled by just two 
companies. However, we have to address the duopoly of software - 
given that this is how people access large parts of contemporary 
reality, this software must be liberalized and controlled by a public 
agency. Also open source software must be adopted in schools and 
universities. The trend of using one system that seems to be in 
vogue must end. 

OSNs and their new algorithms should be required to pass some 
kind of stress test with their systems before their actual deployment: 
this should be monitored and certified by a global agency such as 
the proposed SAA. Since humans are sensitive to and respond to 
emotional triggers: they also share messages that reinforces their 
beliefs: hence, algorithms must be checked to prevent creating 
hatred and animosity[84]. 

The feeble attempts of some governments, e.g., the Canadian 
Bill C-18, to level the playing field by making the OSNs pay for 
the news they use which may be produced by struggling small 
and medium news organizations. The charade that things are “free” 
must end and a reasonable charge should be put on contents that 
are really not free. Allow each unit of content to carry a competitive 
micro-price-tag. This would also allow the removal of all paywalls 
from all news media sites. Since the end user is paying for the use of 
the internet connection and the amount of data used, some portion 
of the charges made by the ISPs should flow back to the original 
producers of the news based on the micro-price-tag. As illustrated 
below, most users consume only a fraction of the news put out by 
any one publication, the micro-charges would be reasonable and 
could easily be built into the fees charged by the ISP. By making 
news accessible from the original responsible source, people will 
spend more time following real news rather than be fed junk by 
the OSNs. 
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ers tell us they enjoy seeing how many 
Guardian journalism they've read, watched 
ened to. So here's your count. Can we continue 
ing you this on support appeals like this? 


Figure 3: Number of articles read 


A pay wall system for news media do not allow users to read 
the articles. In the case of some newspapers, the paywall keeps giving 
messages for supporting the media and keeps a count of articles read. 


It is the current practice that many large news organizations 
offer the digital content to a user for a modest amount such as one 
dollar a week. However, this is an exhorbitant amount considering 
the fact the most users read several news outlets. However, the 
amount of material used from each such outlet is relatively small. 
Hence, automatically transferring a portion of the internet connect 
monthly charge to the source of the information would solve the 
problem, without a multitude of digital subscriptions. 

The above scheme, in addition to being simple, would allow all 
ads in the contents removed since the end user is paying for the 
contents. Eliminating the ads would save bandwidth and energy; 
the latter is good for the environment. This solution would mean 
that there would be no need for pay-walls and nagging requests 
to sign up or being put on a mailing list for headlines etc. Web 
browsers should not ever allow trackers and third party cookies to 
restore the web to its original spirit of sharing. 

Considering the fact that the big-techs are using the ideas and 
even the software algorithms released by previous generation sys- 
tems, and exploiting these openly accessible concepts to create 
systems that is exploiting the human race and amassing a fortune 
while the needs of millions of humans including children are not 
met. One looks at one of the ideas of property by Locke who con- 
sidered a person’s work as his property as long as enough is left 
for the common good of others [56]. 

“Sec. 27. Though the earth, and all inferior creatures, be common 
to all men, yet every man has a property in his own person: this 
nobody has any right to but himself. The labour of his body, and the 
work of his hands, we may say, are properly his. Whatsoever then 
he removes out of the state that nature hath provided, and left it in, 
he hath mixed his labour with, and joined to it something that is his 
own, and thereby makes it his property. It being by him removed 
from the common state nature hath placed it in, it hath by this 
labour something annexed to it, that excludes the common right of 
other men: for this labour being the unquestionable property of the 
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labourer, no man but he can have a right to what that is once joined 
to, at least where there is enough, and as good, left in common for 
others.” 


Figure 4: Copy Forward 


The concept of CopyForward was put forward by the author 
to not let baron-enterpreneurs to exploit human knowledge for private 
gain. It depends on moral obligation with the hope that it will help 
the coming generations. 


There are various forms for ’protecting’ a person’s work which 
is their property: the usual is copyright. In the digital age, it is 
becoming difficult to enforce this! The author has come up with 
CopyForward, given below, which allows a digital content as the 
property of the person creating it but also to share in the sense that 
could be determined by the creator as given below: 

“The document/work, in digital/electronic form, could be used 
for personal use and/or study, free of charge. Anyone could use it 
to derive updated versions. The derived version must be published 
under CopyForward. All authors of the version used to derive the 
new version must be included in the updated version in the existing 
order, followed by name(s) of author producing the derived work. 
Such derived version must be made available free of charge in 
electronic/digital form under CopyForward. Any other means of 
reproduction requires that part of the profit(income minus the 
actual production cost), not less than a third(33.33 percent), should 
be shared with established charitable organizations for children. 
Persons who found this document/work or any derived work useful 
are encouraged to also make a donation to the author(s) and/or 
their favourite charity. 

Make sure to choose a charity which has very modest adminis- 
trative charges(NOT more than 20% of their entire annual budget) 
or some deserving children in your own community.” 

One of the first CopyForward items can be accessed from the 
Spectrum library[31]. 

Wonder if the people who released the software and the concept 
being used by the big-tech had used CopyForward! 
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In the meantime, the author hopes that the IT community would 
set up a project to realize the proposal[30] to allow an ordinary 
person to set up her own email and web server. The ownership of 
data could than be reclaimed and there would be little need of these 
big-techs. 
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Figure 5: Heimdallr 


This is a system diagram to illustrate a turn-key system that 
could be marketed as a replacement for a modem-router - it has built 
in email and web server to allow anyone to reclaim the custody of her 
personal data. 


REFERENCES 


1] Aksoy, Ayberk; Desai, Bipin C: Heimdallr_1: A system design for 
the next generation of IoTs, ICNSER2019: March 2019 pp 92-100 
https://doi.org/10.1145/3333581.3333590 

2] AAI, America in the Age of Information, July 6-7, 1995, Lister Hill Center, 
Bethesda MD, http://users.encs.concordia.ca/ bedesai/Age-of-Information-July- 
1995.pdf 

3] Anders, Giinther The obsolescence of man - Volume 2, 
https://files.libcom.org/files/ObsolescenceofManVol II Gunther Anders.pdf 

4] Appathurai. James Tech is enabling autocrats. Here’s how to fight back The Globe 
and Mail, 24 March, 2922 https://www.theglobeandmail.com/opinion/article- 
tech-is-enabling-autocrats-heres-how-to-fight-back/ 

5] Amnesty International: Surveillance Giants: How The Business Model Of Google 
And Facebook Threatens Human Rights, 21 Nov, 2019, Index Number POL 
30/1404/2019 https://www.amnesty.org/en/documents/pol30/1404/2019/en/ 

6] Albertus, Michael; Menaldo, Victor: Aftermath of Revolution, NY Times, Feb. 
14, 2013 https://www.nytimes.com/2013/02/15/opinion/global/aftermath-of- 
revolution.html 

7] Al Mallees, Nojoud: App’s data tracking resulted in loss of users’ privacy, 
says report by federal, provincial authorities, CBC News - Jun 01, 202 
https://www.cbc.ca/news/business/tim-hortons-app-report-1.6473584 

8] Albergotti, Reed FACEBOOK UNDER FIRE Wahington Post, 26 Oct. 2021 
newblock https://www.washingtonpost.com/technology/2021/10/26/frances- 
haugen-facebook-whistleblower-documents/ 

9] Bogost, Ian Googles Sentient Chatbot Is Our Self-Deceiving Future. The Atlantic, 
June 2022 https://www.theatlantic.com/technology/archive/2022/06/google- 
engineer-sentient-ai-chatbot/661273/ 

[10] Brittain, Blake: Meta hit with trademark lawsuit over new infinity-symbol logo, 
Reuters, May 2, 2022 https://www.reuters.com/legal/litigation/meta-hit-with- 
trademark-lawsuit-over-new-infinity-symbol-logo-2022-05-02/ 


IDEAS2022 


(11) 


[15] 


[16] 


(17] 


24] 


[25] 


26] 


32] 


33] 


(34] 


(35] 


Bipin C. Desai 
Behr, Rafael The Uber files tell a simple truth: democracy 
depends on curbing mercenary tech giants, The Guardian, 


https://www.theguardian.com/commentisfree/2022/jul/11/uber-files- 
democracy-silicon-valley 


] Butler, Samule Erewhon or Over the Range Original 9 June, 1872. Prjoject 


Gutenberg March 20, 2005 https://www.gutenberg.org/files/1906/1906-h/1906- 
h.htm 
Charles Babbage, Wikipedia https://en.wikipedia.org/wiki/Charles_Babbage 


] CBC, The Canadian Press B.C. court allows class-action lawsuit 


against Facebook to expand The Canadian Press, 14 May, 2019 
https://www.cbe.ca/news/canada/british-columbia/facebook-class-action- 
expansion-1.5135031 

Chen, Brian X.: I Downloaded the Information That 
Facebook Has on Me. Yikes, NY Times, Apr. 11, 2018, 
https://www.nytimes.com/2018/04/11/technology/personaltech/i-downloade 
d-the-information-that-facebook-has-on-me-yikes.html 

Charette, Robert N.: Canadian Government’s Phoenix Pay System 
an “Incomprehensible Failure”: That’s the nicest thing that could be 
said for a debacle of the first rank, IEEE Spectrum, 05 Jun 2018 
https://spectrum.ieee.org/riskfactor/computing/software/ canadian- 
governments-phoenix-pay-system-an-incomprehensible-failure 

CSA Group https://www.csagroup.org/ 


] Cann, Vicky Uber’s privileged access to politicians shows the lobby 


system urgently needs to change, The Guardain, Mon 11 Jul 2022 
https://www.theguardian.com/commentisfree/2022/jul/11/uber-privileged- 
access-eu-politicians-lobby-system-change 


] Desai, Bipin C. Technological Singularities, IDEAS 15, July 13 - 15 2015, Yokohama, 


Japan http://dx.doi.org/10.1145/2790755.2790769 


] Desai, Bipin C. The Web of Betrayals, IDEAS 2018, June 18-20, 2018, Villa San 


Giovanni, Italy https://doi.org/10.1145/3216122.3216140 


] Desai, Bipin C.: Colonization of the Internet, IDEAS 2021, July 2021, pp 36-45 


https://doi.org/10.1145/3472163.3472179 


] Desai, Bipin C.: newblock Report of the Navigation Issues Workshop, Computer 


Networks and ISDN Systems, Vol. 27-2, November 1994, pp. 332-333. 


] Desai, Bipin C.; Swiercz, Stan: WebJournal: Visualization of a Web Jour- 


ney, In: Digital libraries: research and technology advances: selected pa- 
pers: ADL’95 Forum, McLean, Virginia, USA, May 15-17, 1995. Lecture notes 
in computer science (1082). Springer, Berlin, pp. 63-80. ISBN 9783540614104, 
https://spectrum.library.concordia.ca/983869/ 

Desai, Bipin C.: Test: Internet Indexing Systems vs List of Known URLs, June, 
1995, available on the Web from https://users.encs.concordia.ca/~bcdesai/test- 
of-index-systems.html, https://spectrum.library.concordia.ca/983875/ 

Desai, Bipin C:: Test: Internet Indexing Systems vs List of Known 
URLs: Re visited, October 1997, available on the Web from 
https://users.encs.concordia.ca/~bcdesai/test-of-index-systems-revisited.html, 
https://spectrum.library.concordia.ca/983876/ 

Desai, Bipin C.; Pinkerton, Brian (ed): Proceedings of the WWW III Workshop on 
Web-wide Indexing/Semantic Header or Cover Page, Darmstadt, Germany, April 
1995, https://users.encs.concordia.ca/~bcdesai/www3-wrkA/www3-wrkA- 
Proc.ps.gz 


] Desai, Bipin C.: Search and Discovery on the Web, October 2001, 


https://spectrum.library.concordia.ca/983874/ 


] Desai, Bipin C.: The state of data, IDEAS2014, Portugal July 2014, 77-86, ISBN: 


978-1-4503-2627-8 DOI 10.1145/2628194.2628229 


] Desai, Bipin C.: JoT-Imminent Ownership Treat,, Proc. IDEAS 2017, Bristol, July 


2017, pp 82-89, DOI 10.1145/3105831.3105843 


] Desai, Bipin C.: Privacy in the Age Of Information (and algorithms) IDEAS 2019, 


June 2019, Athens, Greece https://doi.org/10.475/3331076.3331089 


] Desai, Bipin C.;Kipling, Arlin L Database Web Program- 


ming, BytePress, 2020, ISBN 9781988392066 9781988392042 
https://spectrum.library.concordia.ca/id/eprint/988529/2/WebDB-Desai- 
Kipling-Oct-2020.pdf 

Desai, Bipin C.: An Introduction to Database  Sys- 
tems, West, St. Paul, MN. 1990, ISBN  0-314-66771-7, 
https://spectrum.library.concordia.ca/id/eprint/988586/1/An-Introduction-to- 
Database-Systems-Bipin-C.DESAI pdf 

Dwoskin, Elizabeth; Owen, Annie G : On WhatsApp, fake news 
is fast — and can be fatal, Washington Post, 23 July, 2018 
https://www.washingtonpost.com/business/economy/on-whatsapp-fake- 
news-is-fast—and-can-be-fatal/2018/07/23/a2dd7112-8ebf-11e8-bed5- 
9d911c¢784c38_story.html 

Dwoskin, Elizabeth; Newmyer, Tory; Mahtani, Shibani: 

The case against Mark Zuckerberg: Insiders say Facebook’s CEO 
chose growth over safety Wahington Post, 25 October, 2021 
https://www.washingtonpost.com/technology/2021/10/25/mark-zuckerberg- 
facebook-whistleblower/ 

Davies, Harry; Goodley, Simon; Lawrence, Felicity; Lewis, Paul; O’Carroll, Lisa: 
Uber broke laws, duped police and secretly lobbied governments, leak reveals, The 


52 


Meta-stasis of the Internet 


36 


37 


38 


39 


50 


51 


52 
53 


54 


55 


56 


57 


58 


Guardian, July 11. 2022 https://www.theguardian.com/news/2022/jul/10/uber- 
files-leak-reveals-global-lobbying-campaign 

Davies, Harry; Goodley, Simon; Lawrence, Felicity; 
O’Carroll, Lisa: The Uber files The Guardian, 
https://www.theguardian.com/news/series/uber-files 

Dani, John A.; Balfour, David J.K. Historical and current perspective on tobacco 
use and nicotine addiction Review Historical Perspective], 14-7, P383-392, July 
01, 2011 https://doi.org/10.1016/j.tins.2011.05.001 

Davies, Rob; Goodley, Simon Uber bosses told staff to use ‘kill switch’ 
during raids to stop police seeing data The Guardain, 10 July, 
2022 https://www.theguardian.com/news/2022/jul/10/uber-bosses-told-staff- 
use-kill-switch-raids-stop-police-seeing-data 

Favreau, Francois-Alexis Sonneur d’alerte chez Uber: le lobbyiste 
Mark MacGann se dévoile, SRC , 11 July 2022 https://ici.radio- 
canada.ca/nouvelle/1897240/lanceur-alerte-uber-mark-macgann-systeme- 
mensonge? 

Lawrence, Felicity Uber paid academics six-figure sums for re- 
search to feed to the media The Guardian, 12 July, 2022 
https://www.theguardian.com/news/2022/jul/12/uber-paid-academics- 
six-figure-sums-for-research-to-feed-to-the-media 

Guinan, Joe; O’Neill, Martin: Only bold state intervention will save us 
from a future owned by corporate giants, The Guardian, 6 Jul 2020, 
https://www.theguardian.com/commentisfree/2020/jul/06/state-intervention- 
amazon-recovery-covid-19 

Greenberg, Martin The Computers of Tomorrow, The Atlantic, 
1964 https://www.theatlantic.com/magazine/archive/1964/05/the-computers-of- 
tomorrow/658239/ 

Gibbs, Samuel: Apple fixes HomeKit bug that allowed re- 
mote unlocking of users’ doors , The Guardian 8 Dec. 2017 
https://www.theguardian.com/technology/2017/dec/08/apple-fixes-homekit- 
bug-remote-unlocking-doors-security-flaw-iphone-ipad-ios-112-smart-lock- 
home 

Greenhouse, Steven: Amazon chews through the average worker 
in eight months. They need a union, The Guardian, Feb 4, 2022, 
https://www.theguardian.com/commentisfree/2022/feb/04/amazon-chews- 
through-the-average-worker-in-eight-months-they-need-a-union 
Goldmacher, Shane; Haberman, Maggie: Trump Raises $170 Million 
as He Denies His Loss and Eyes the Future The NY Times, Aug. 


Lewis, Paul; 
11 Jul 2022 


7, 2021  https://www.nytimes.com/2020/11/30/us/politics/trump-campaign- 
donations.html 

Guardain Staff Google fires software engineer who claims 
AI chatbot is sentient, The Guardain staff, 23 Jul 2022 


https://www.theguardian.com/technology/2022/jul/23/google-fires-software- 
engineer-who-claims-ai-chatbot-is-sentient 

Hern, Alex TechScape: suspicious of TikTok? You’re not alone, The Guardian, 0 
Jul 2022 https://www.theguardian.com/technology/2022/jul/20/tiktoks-privacy- 
problem-isnt-what-you-think 

Hamby, Chris; Forsythe, Michael: Behind the Scenes, McKinsey Guided 
Companies at the Center of the Opioid Crisis New York Times, June 29, 
2022 https://www.nytimes.com/2022/06/29/business/mckinsey-opioid-crisis- 
opana.html 

Hager, Mike; O’Kanet, Josh: Business groups, Tories seek changes to 
Canadian tax system after Amazon findings GLobe and Mail, July 18, 
2022 _ https://www.theglobeandmail.com/business/article-industry-groups- 
opposition-call-for-changes-to-canadian-tax-system/ 

Isaac, Mike How Uber Deceives the Authorities Worldwide New York Times, 3 
March, 2017 https://www.nytimes.com/2017/03/03/technology/uber-greyball- 
program-evade-authorities.html 

Keefe, Patrick Radden: Empire of Pain 
13:9780385545686 

Knox, Ron The copyright kille 11 January 2019 

Kermani, Secunder: Pakistan activists targeted in Facebook attacks, BBC, May 
15, 2018, http://www.bbc.com/news/world-asia-44107381 

Kirchgaessner, Stephanie Saudis behind NSO spyware attack on Ja- 
mal Khashoggi’s family, leak suggests The Guardain, 18 Jul 2021 
https://www.theguardian.com/world/2021/jul/18/nso-spy ware-used-to- 
target-family-of-jamal-khashoggi-leaked-data-shows-saudis-pegasus 

Lima, Cristiano A whistleblower’s power: Key takeaways 
from the Facebook Papers Washington Post, 26October, 2021 
https://www.washingtonpost.com/technology/2021/10/25/what-are-the- 
facebook-papers/ 

Locke, John Second Treatise of Government 
https://www.gutenberg.org/ebooks/7370 

Logan, Nick- Rogers network outage a ‘wake-up call’ 
business, essential services disrupted CBC News 8 July, 
https://www.cbc.ca/news/business/rogers-outage-no-plan-b-1.6515664 
McCabe, David: Congress and Trump Agreed They Want a Na- 
tional Privacy Law. It Is Nowhere in Sight. NYTimes Oct. 1, 2019 


Doubleday Books, 2021 ISBN 


1688, Gutenberg 2005 


after 
2022. 


IDEAS2022 


59 


61] 


62] 


68] 
69] 


[70] 


71] 


(72] 


74] 


76] 


81] 


82] 


] Riekeles, Georg: 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


https://www.nytimes.com/2019/10/01/technology/national-privacy-law.html 
Manjoo, Farhad Tech’s rightful 5° Will Dominate Digital Life 
for Foreseeable Future Give this article NY Times 20 Jan., 2016 
https://www.nytimes.com/2016/01/21/technology/techs-frightful-5-will- 
dominate-digital-life-for-foreseeable-future.html 


] Manjoo, Farhad I Was Wrong About Facebook NY Times. 21 July, 2022 


https://www.nytimes.com/2022/07/21/opinion/farhad-manjoo-facebook.html 
McQuillan, Laura: A Google engineer says AI has become sentient. What does 
that actually mean?, CBC, 24 Jun, 2022 https://www.cbc.ca/news/science/ai- 
consciousness-how-to-recognize-1.6498068 

McCarthy. J; Minsky, M. L.; Rochester. N; Shannon, C.E.: Bell Telephone 
Laboratories A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH 
PROJECT ON ARTIFICIAL INTELLIGENCE 31 August, 1955 https://www- 
formal.stanford.edu/jmc/history/dartmouth/dartmouth.html 

Merrill, Jeremy B. ; Oremus, Will: Five points for anger, one for a ‘like’: How 
Facebook’s formula fostered rage and misinformation Washington Post, 26 Octo- 
ber, 2021 https://www.washingtonpost.com/technology/2021/10/26/facebook- 
angry-emoji-algorithm/ 

McEvoy, Jemima Trump Raised $250 Million Since Election To Challenge 
Outcome—Here’s Where Most Of The Money Will Actually Go Forbes, 31 Jan., 
2021, https://www.forbes.com/sites/jemimamcevoy/2021/01/31/trump-raised- 
250-million-since-election-to-challenge-outcome-heres-where-most-of-the- 
money-will-actually-go 

McLuhan, Marshall: Understanding Media: The Extensions of Man, McGraw-Hill. 
1964 

Marshall McLuhan, Bruce R. Powers, The Global Village: Transformations in 
World Life and Media in the 21st Century, Oxford University Press, 1992. 
McGowan, Michael Brisbane teenager built spyware used by domestic vi- 
olence perpetrators across world, police allege The Guardain, 30 July, 2022 
https://www.theguardian.com/australia-news/2022/jul/30/brisbane-teenager- 
built-spyware-used-by-domestic-violence-perpetrators-across-world-police- 
allege 

Niiler, Eric How the Second Industrial Revolution Changed AmericansLives 
https://www.history.com/news/second-industrial-revolution-advances 

Peter, Ian: The history of email, http://www.nethistory.info/History of the 
Internet/email.html 

Peter, Ian: History of the World Wide Web, http://www.nethistory.info/History 
of the Internet/web.html 

Paul, Kari; Milmo, Dan: Mark Zuckerberg to face deposition 
over Cambridge Analytica scandal, The Guardian, 20 Jul 2022, 
https://www.theguardian.com/technology/2022/jul/20/mark-zuckerberg- 
deposition-cambridge-analytica-facebook 

Rappeport, Alan; Tankersley, Jim: How Joe Manchin Left a Global Tax 
Deal in Limbo T reasury Secretary, New York Times July 18, 2022, 
https://www.nytimes.com/2022/07/18/us/politics/joe-manchin-tax.html 

I saw first-hand how US tech giants seduced the 
EU — and undermined democracy, The Guardian, 28 June 2022, 
https://www.theguardian.com/commentisfree/2022/jun/28/i-saw-first- 
hand-tech-giants-seduced-eu-google-meta 


Rankin Jennifer: EU urged to investigate ex-politician’s Uber 
links and rein in tech lobbyist The Guardian, 12-Jul, 2022, 
https://www.theguardian.com/news/2022/jul/12/eu-urged-investigate- 
ex-politician-uber-links-rein-in-tech-lobbyist 

Samaha, Albert: How Amazon Exported American Work- 
ing Conditions To Europe, Buzzfeed, June 23, 2022, 


https://www.buzzfeednews.com/article/albertsamaha/amazon-poland- 
slovakia-czechia-germany-labor-laws 

Satariano, Adam: E.U. Takes Aim at Social Media’s Harms With 
Landmark New Law The New York Times, 22 April, 2022, 
https://www.nytimes.com/2022/04/22/technology/european-union-social- 
media-law.html 


] Smith, Justin E. H.: The Internet Is Not What You Think It Is: A History, a Philosophy, 


a Warning, Princeton University Press, 2021, ISBN 13: 9780691229683 

Shelley, Mary W:: Frankenstein or, The Modern Prometheus, 
Henry Colburn and Richard Bentley, 1831, Gutenberg, 2013 , 
https://www.gutenberg.org/ebooks/42324 

Singer, Natasha: A Cyberattack Illuminates the Shaky State of Student Privacy, NY 
Times, 31 July, 2022, https://www.nytimes.com/2022/07/31/business/student- 
privacy-illuminate-hack.html 


] 2021 Study Panel Report: Gathering Strength, Gathering Storms, Standford 


Univeriry, Sept, 2021, https://ai100.stanford.edu 

Editorial: Devastating leaked documents underline the pressing need for 
proper regulation of the digital economy, The Guardian 11 Jul 2022 
, https://www.theguardian.com/commentisfree/2022/jul/11/the-guardian-view- 
on-th 

United Laboratories, https://www.unitedlabsinc.com/ 


53 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


83 


84 
85 


86 
87 


88 
89 


90 


91 


92 


93 
94 


95 


Bush, Vannevar: As we may think, The Atlantic, July 1945, 
https://www.theatlantic.com/magazine/archive/1945/07/as-we-may- 
think/303881/ 

Wardle, Claire: A new World Disorder, Scietific American, 29-4, Fall 2020 
Wahlquist, Calla: | US attacks Australia’s ‘extraordinary’ plan to make 
Google and Facebook pay for news, The Guardian, 18 Jan 2021, 
https://www.theguardian.com/media/2021/jan/19/us-attacks-australias- 
extraordinary-plan-to-make-google-and-facebook-pay-for-news 

Wikipedia Pegasus (spyware), https://en.wikipedia.org/wiki/Pegasus_(spyware) 
Wikipedia Timeline of file sharing, https://en.wikipedia.org/wiki/Timeline 
_of_file_sharing 

Wikipedia History of email, https://en.wikipedia.org/wiki/History_of_emai 
Wikipedia Linux for mobile devices, https://en.wikipedia.org/wiki/Linux 
_for_mobile_devices 
Wikipedia Simple Mail Transfer 
https://en.wikipedia.org/wiki/Simple_Mail _Transfer_Protocol 
Wikipedia History of the Internet, https://en.wikipedia.org/wiki/History 
_of_the_Internet 

Wikipedia Hypertext Transfer Protocol, https://en.wikipedia.org/wiki/Hypertext 
_Transfer_Protocol 

Wikipedia HTML, https://en.wikipedia.org/wiki/HTML 


Protocol, 


Wikipedia Mosaic (web browser),  https://en.wikipedia.org/wiki/Mosaic 
_(web_browser) 
Wikipedia List of mergers and acquisitions by Alphabet. 


https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions _by_Alphabet 


IDEAS2022 


] Wikipedia 


] Wikipedia List 


Bipin C. Desai 


List of mergers and acquisitions by Meta Platforms. 
https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions 
_by_Meta_Platforms 

of mergers and acquisitions by Microsoft. 
https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions _by_Microsoft 
Wikipedia SoftQuad Software, https://en.wikipedia.org/wiki/SoftQuad_Software 
Wikipedia List of mergers and acquisitions by Apple. 
https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_Apple 


] Young, Chris: Tim Hortons proposes settlement in class-action suits over data- 


tracking app - Proposed settlement would offer free coffee and doughnut to affected 
users, The Canadian Press, Jul 29, 2022, https://www.cbc.ca/news/business/tim- 
hortons-app-1.6536175 


y 


] Zakrzewski, Cat; De Vynck, Gerrit; Masih, Niha; Mahtani, Shibani: How Face- 


book neglected the rest of the world, fueling hate speech and violence in India, 
Washington Post, 24 October, 2021, 


Acknowledgement 


The author would like to acknowledge the valuable discussions with 
members of the family and the contribution of many philosophers, 
researchers and investigative journalists cited and perhaps missed; 
these have been valuable in preparing this article. 


54 


Provenance in Spatial Queries 


Paulo Pintor 
IEETA - University of Aveiro 
Aveiro, Portugal 
paulopintor@ua.pt 


ABSTRACT 


Despite data growth being a known problem for several years, there 
are more and more people, tools and devices to create and share 
data, and the need for tools to infer their provenance and quality 
is even more important than before. Research on data provenance 
focuses on W3C PROV and databases (where, why, how). However, 
in the particular case of spatial data, research has mainly focused on 
handling spatial data provenance from documents and workflows, 
but there is no literature approaching the topic of spatial data 
provenance in DBMS and queries. 

This paper deals with the computation of How-, Why- and 
Where- provenance in spatial database queries. It presents an eval- 
uation of how the formalism and methods proposed to deal with 
general-purpose database queries behave when dealing with spatial 
data. Two tools are used to manage provenance in databases and 
a discussion of the results and guidelines for future work are pre- 
sented. This is a first contribution towards dealing with spatial data 
provenance by tuple, attribute and query, whereas previous work 
has only focused on the management of provenance at a coarser 
level, namely documents and workflows. 
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- Theory of computation — Data provenance; - Information 
systems — Query languages; Spatial-temporal systems. 
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1 INTRODUCTION 


Technological evolution is constantly increasing the capacity to 
obtain and process data from a variety of data sources, such as satel- 
lite and aerial images and GPS data captured using mobile devices, 
among many others. Often, these data have a spatial component 
that is important to consider. Since these data may originate from 
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many sources, several issues arise regarding the data quality, relia- 
bility, and trustworthiness, thus leading us to data provenance and 
how to explain the origin and transformations made to the data 
over time. 

Besides helping to ensure these features, data provenance may 
also aid in data debugging by showing how and why a result is ob- 
tained, and thus help to catch errors. The information also helps re- 
producibility or replication, for instance, when we want to perform 
new tests with a dataset used in the past and we need to recreate the 
environment [17]. Since it collects data about the transformation 
and how the result has been obtained, it will also contribute to 
understandability [17]. 

The PROV-W3C [8] is a standard for representing and exchang- 
ing data about the agents and the processes involved in creating a 
data instance, using concepts such as Agent, Activity, and Entity [7]. 
There are also works on representing data provenance in databases 
with finer granularity, namely, at tuple and attribute level [4, 6, 30] 
rather than at entity or table level. The main types of provenance 
in this context are how-, where- and why-provenance, and each one 
aims to explain the query results from different perspectives. 

For spatial data, there are also several works on maintaining 
information about the origin and transformation of spatial datasets, 
including the algorithms and computational systems used, from 
their acquisition to their storage in databases or other types of spa- 
tial data infrastructures [20]. However, to the best of our knowledge, 
there is no work evaluating whether the theories and formalism 
proposed in the literature to compute the provenance of the results 
of database queries can also be applied to spatial data. 

This paper presents a study on how to compute the provenance 
of the data in the result of a spatial database query. Starting from 
a classification of spatial data and operations [26] that allows us 
to list the types and the cases that we must consider when dealing 
with spatial queries, we present a systematic evaluation of how 
to compute and what are how-, where- and why-provenance. This 
evaluation uses ProvSQL, a tool for provenance and probability 
management in PostgreSQL [30] and a tool for provenance compu- 
tation in distributed databases [25]. 

The rest of this paper is organized as follows. Section 2 presents 
the basics of data provenance and the formalism used in databases, 
what spatial data are and what kind of spatial operations exist, 
and an overview of the spatial data provenance literature. It also 
highlights that there are research gaps in this area. Next, section 3 
shows how to compute the three types of provenance considered 
in this work for each type of spatial operation and what particular- 
ities emerge in this context. Section 4 presents spatial queries and 
provenance information of the results obtained using the two tools 
considered in this work.Section 5 presents a brief discussion of the 
results and guidelines for future work. briefly discusses the results 
and describes guidelines for future work. 
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2 BACKGROUND AND RELATED WORK 


2.1 Data provenance 


The concern about data description is not new. In the 1990s, still 
under the lineage concepts, research has already been made on 
this subject [5]. Later, in the 2000s, many papers on data prove- 
nance in fields like data warehousing, semantic web, and scientific 
workflows, among others [31], have been published. Recently, more 
emphasis has been given to this topic due to data science, the prob- 
lem of data growth, and the need to extract knowledge from raw 
data and ascertain the data quality. 

In [17, 18] it is proposed to divide provenance into four levels, 
namely, provenance meta-data, information system provenance, 
workflow provenance, and data provenance. The provenance meta- 
data is the lowest level in instrumentation and is more general 
in process or provenance models because it can be anything that 
describes an object. On the other side, we have data provenance 
that is more specific in terms of process and model, and with a high 
level of instrumentation since it collects provenance from databases 
and depends on the semantics, the query languages, and the data 
models. 

The World Wide Web Consortium (W3C) has proposed a stan- 
dard model and ontology to describe provenance called PROV to 
help to describe Workflow provenance [8]. This standard can be 
seen as an evolution of the Lineage standards ISO 19115 and ISO 
19115-2. The W3C PROV contains three different elements: Entity, 
Activity, and Agent. 


e Entity is what we can call a thing, can be real or imaginary, 
e.g., something digital; 

e Activity occurs in a given temporal space. It can use, modify, 
process or generate new entities; 

e An agent can be anything from a person to software. It can 
have some responsibility for an entity or other agent’s ex- 
istence and some type of responsibility for activities. This 
responsibility further means that an agent can be a particular 
type of entity or activity. Hence, we can reveal the prove- 
nance of an agent. 


Although agreeing with the W3C proposed model, some authors 
mention that in a few database contexts, the model brings some 
unnecessary formalisms [5]. In contrast, it is also possible to under- 
stand the model’s potential when applied in complex areas such as 
health and geospatial data [8, 21]. 

The main types of data provenance in the literature are why-, 
how- and where-provenance [5, 6, 14]. 

Why-provenance gives the tuples that contributed to a query 
result [4—6, 14]. The tuples are seen as witnesses of the result, and 
the technique to obtain this kind of provenance is called Witnesses 
basis. Formally, it can be described as Why(Q,1,t) = {I’ € It € 
Q(1’)} such that, for a query Q over a database J and a tuple ft in 
Q(D), an instance of I’ C I is a witness for t if t € Q(I’) [4, 6]. The 
result is a set of tuples with all the possible combinations but does 
not contain duplicates. 

Whilst Why-provenance shows the tuples that are involved in 
a query result, How-provenance explains how these tuples con- 
tributed to the result. How-provenance resorts to algebraic and poly- 
nomials expressions, called semirings, to obtain the result, and this 
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technique assumes that each tuple has an identifier called prove 
token [4-6, 13, 14, 28]. 

The general definition of a semiring is: (K, 0, 1, ®, ®), and can be 
used to achieve different answers. For how-provenance the universal 
semiring or how-semiring is (N[X], 0, 1, ®, ®). The N is a set of data 
elements that will be annotated using the constants 0 and 1, where 0 
means the tuple is not in the query result Q and 1 means the tuple is 
present in the query result. The binary operators show how tuples 
relate to each other, such that @ is used to show an alternative 
and ® a joint. The semiring has commutative, associative, and 
distributed rules, depending on the query’s operation [13]. Unions 
are associative and commutative operations and are represented by 
®. Joins also have those two properties, but they are also distributive 
over unions and they are represented by ®. The projections and 
selections are also commutative among themselves. 

From the above, why-provenance gives the tuples (witnesses) 
involved in a result and how-provenance explains how the tuples 
correlate to get the result. The last one, where-provenance, is con- 
cerned with the origin of the individual values (instead of tuples) in 
the query result. Thus, Unlike the previous ones, it is not possible 
to capture this kind of provenance using semirings. To deal with 
this issue, [28] proposes to add annotations without algebra terms 
to create a bipartite graph that shows how the values are connected 
to specify the where-provenance result. 


2.2 Spatial data 


In current systems, data can come from a variety of sources, which 
can be more or less reliable, and data cleaning and transformation 
processes, including data integration and wrangling, also have an 
influence on data accuracy and quality [1]. These issues above 
are transverse to all data types. However, they are particularly 
important when dealing with spatial data. 

Spatial data are used to represent geographical objects, such 
as roads, properties, and lakes.In the context of geographic infor- 
mation systems, spatial data are represented using coordinates 
(latitude and longitude). The three main types of spatial data are 
point, line, and polygon (often also referred as region) [15, 16, 27], 


e A point represents a specific location, like a traffic light in a 
street; 

e A line can be curved and represent roads, for example; 

e A region is delimited by external boundaries and can have 
or not internal boundaries to represent holes. It can define 
objects with an extension like a lake. 


Moreover, spatial objects can also be represented by more com- 
plex geometries, e.g., multi-points or multi-polygons, and even ge- 
ometries combining points, lines, and polygons. There is a standard 
data model called SQL/MM Spatial [32] to represent spatial data in 
Database Management Systems (DBMS). This standard was known 
initially as the OpenGIS Simple Features Specification for SQL and 
defines the spatial data formats, including the well-known text rep- 
resentation (WKT), well-known binary representation (WKB), and 
geography markup language (GML), as well as the Spatial Reference 
Systems. SQL/MM also defines the operations and functions to deal 
with spatial data, namely, to convert between spatial data formats, 
retrieve spatial properties, and find the interaction between spatial 
objects. 
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In this work, we will use the classification of spatial operations 
proposed in [26]: 


e Regarding the number of parameters, the spatial operations 
can be unary or N-Ary. The former accepts only one parame- 
ter, e.g., to return the area of a polygon or the length of a line, 
to transform an object (rotation, translation, skewing, and 
scaling) or to simplify an object. Binary operations accept 
two parameters and can be further classified into topological 
(e.g., adjacent, contains, inside, equals or overlaps), distance 
or direction (e.g., North, South, Northwest, etc.). These op- 
erations are often called spatial joins. Set operations can 
operate on multiple spatial objects, e.g., the union of several 
polygons. 

e Spatial operations can return a Boolean, a value, or new 
spatial objects. The former are called predicates and can be 
unary, e.g., isValid to determine if the geometry of a spatial 
object is valid, or binary to check if there is a topological or 
directional relationship between two spatial objects. Other 
operations return a scalar, such as an area, perimeter, or 
distance operations. Finally, there are operations that return 
new spatial objects and can be unary, e.g., to return the 
convex-hull of a spatial object, binary, e.g., to obtain the 
difference between two spatial objects, or operate on several 
spatial objects, e.g., to compute the union or intersection of 
spatial objects. 


2.3. Related Work 


Spatial data may have distributed sources and the data sets can be 
unstructured (e.g. a file). Therefore, it is important to ensure the 
datasets’ quality and trustworthiness, bearing in mind problems 
such as the completeness of the data sets or data set decadence 
over the years [23]. The last problem is widespread in this data 
type since it can represent maps. For example, roads are constantly 
changing in a city (road works, directions change, etc.), so a dataset 
that dates back a few years may not correspond to a road’s actual 
state. 

Consequently, provenance is essential to ensure the quality of 
the data and its trustworthiness. Looking at the literature on spatial 
data provenance, it is possible to understand that the focus is on 
Workflows. Most recent works use the W3C PROV [7, 8], while early 
works in this field started with the Lineage standards of ISO 19115 
and ISO 19115-2 [9, 10]. A general transition is being made from 
ISO standards to W3C PROV, as demonstrated in [19, 20] where it 
is possible to understand how to proceed with the transition. 

Spatial workflows are used to store and show the information 
about all the steps or processes applied to the data. This includes 
the data’s origin and the operations or algorithms that the data have 
been through until the final dataset. An excellent example of this 
process can be a map composed of multiple layers. Resorting to the 
W3C elements - agents, entities, and activities - is possible to store 
the multiple data’s origins, demonstrate if some algorithm made 
some change to the data (like coordinators correction), and the pro- 
cedures to create the layers formed by attributes or characteristics, 
and the final dataset representing the map. 

There also are works trying to use provenance with unstructured 
spatial data from text [22]. In this case, the authors used data from 
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social networks (location attributes, to be more precise) and studied 
the possibility of deriving the provenance of that unstructured 
information. The authors claim to have 80% accuracy in identifying 
the location provider. 

None of the previous works deal with the provenance of the 
results in spatial database queries. Although, it is important to 
understand if the spatial data characteristics and specifications can 
be supported by the existing approaches in different scenarios. 

In [24], the authors conducted a survey over several solutions for 
provenance in different areas, including data provenance.Though, 
these solutions to deal with spatial data need at least to work with 
DBMS that has support for that kind of data. Therefore, combining 
the research presented in that paper with the objectives of our work, 
we chose three solutions to show how different the approaches can 
be. The three solutions are ProvSQL [30], Perm [12] and GProM [3]. 
These three solutions work with PostgreSQL, a DBMS that supports 
spatial data, but they all approach the data provenance problem 
with different perspectives. 

Perm is an extension to PostgreSQL, and the approach is based 
on query rewriting. The authors mentioned that it supports regular 
queries, although it is not prepared to support correlated subqueries. 
Perm engages the why- and where-provenance. 

GProM is the only one in these three solutions that works with 
more than one DBMS. In this case, it works with Oracle, SQLite, and 
PostgreSQL, and the approach is to use a middleware. This platform 
intends to be used not only for provenance management but also for 
annotations management. GProM captures the why- and why-not 
provenance in terms of data provenance, and the authors claimed 
it can also deal with spatio-temporal information. 

ProvSQL is a lightweight extension for PostgreSQL that sup- 
ports provenance computation and probabilistic query evaluation. 
ProvSQL uses semiring theory to compute how-provenance and 
proposes an extension to semirings called m-semirings to support 
negation. It also supports the capture of where-provenance. This 
work, like the two mentioned above, does not show how to deal 
with data provenance and spatial queries. 


3 SPATIAL PROVENANCE 


In this section, we show how SQL/MM Spatial objects and functions 
are related with data provenance, focusing on operations such as 
intersection, distance or overlaps, which are not considered in 
previous work. 

In data provenance, it is necessary to understand how these 
operations act in the query result, and with the example of semiring 
theory, how the operators © and ® should be used with these 
functions. The Spatial functions can be divided into several different 
groups according to what operations we need to perform with the 
data. In the following, we will divide the spatial operations based 
on [26] and for each type of spatial operations, we investigate how 
Why-, How- and Where- data provenance can be used. 

To also help demonstrate data provenance with spatial objects, 
we will use the objects depicted in Figure 1 and Table 1. It is as- 
sumed that each spatial object has an attribute “geom” denoting its 
representation (geometry), an attribute “token” that is the token 
needed to compute the data provenance, and a column “name. 
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Figure 1: An example of spatial objects 


A 

name geom token 
A POLYGON((2.5 2.5,2.5 4.5,4.5 4.5,4.5 2.5,2525)) tl 
B  POLYGON((4 4,4 6,6 6,6 4,4 4)) t2 


Table 1: Table representing the two objects from figure 1 with 
the name “A" 


name how why where 
A (t1) {ti}  {[A.t1.name]} 
B (t2) {t2} {[At2.name]} 
Table 2: Result provenance for Unary Operations with 
boolean result. 


name area how why where 
A 4 (t1) {ti} {[A.t1.name], []} 
B 4 (t2) {t2} {[A.t2.name], []} 
Table 3: Result provenance for Unary Operations with scalar 
result. 


3.1 Unary operations 


Unary operations are spatial operations involving only one object 
and can be divided according to the result. 


Operations with boolean result. - These functions will return a 
boolean result for a test in a particular object. Looking at Figure 1 
and Table 1, an example of a query could be to select a column 
with the objects’ name (“name”) and a column indicating whether 
the geometry of each of the selected objects is valid, resorting to 
the function “isvalid” and the polygons geometry (“geom") column. 
The three types of provenance of the query’s result are presented 
in Table 2. 

The result shows that for why- and how-provenance, the prove- 
nance information is formed only by the token of the tuples, and 
the where-provenance also includes the table identifier, the tuple 
token and the respective column. 


Operations with scalar result. - functions with scalar results aim 
to obtain values like area, perimeter or length of an object. To show 
the provenance results, we will calculate the area of the two squares 
and show it with the name. 
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name rotate how why where 
A POLYGON((-2.5 -2.5,-2.5-4.5,- (t1) {ti} {[A.tl.name], 


4.5 -4.5, -4.5 -2.5,-2.5 -2.5)) []} 
B — POLYGON((-4 -4,-4-6,-6 -6,-6 (t2) {t2} {[A.t2.name], 
-4,-4 -4)) [I]t 


Table 4: Result provenance for Unary Operations with spatial 
result. 


Table 3 shows that for the functions with scalar results, how- and 
why-provenance are the tuple tokens. But for where-provenance, 
it shows the set of tuples with the tables for each column in the 
output result, although the “area” column has the result of @. This 
is because where-provenance is attributed-based, in contrast to the 
other types, which are tuple-based. The new column, created by 
the function, does not exist in the database and has no meaning for 
where-provenance. 


Operations with spatial result. - functions with a spatial result 
intended to transform one spatial object into another, and there 
are many different operators. One example with the squares in 
the Figure 1 can be the rotation of each polygon by II radius - 
rotate(geom, pi()). 

As it is possible to understand by the result in Table 4, the how- 
and why-provenance are the tuples tokens, and where-provenance 
only shows provenance for the name since the rotations create a 
new object that is not part of the table. 


3.2 Binary operations 


These operations always involve two objects, and as the unary 
operations, they can be divided by the same result types. 


Operations with boolean result. - In binary, these types of oper- 
ations are called Spatial Predicates. They are the basis for spatial 
querying (Spatial selections and Spatial joins) [26]. These oper- 
ations have three subdivisions: Topological predicates, direction 
predicates and metric predicates. The first one represents topo- 
logical transformations like intersects or contains, among others 
and shows, for example, if a line intersects a point. The second 
shows the relative position between two objects, that is, whether 
one object is south of another. The metric predicates can be used 
to compare the distance between objects. 

Consider Figure 1 and Table 1. To intersect the two rectangles, 
one may perform a query that joins the table with itself by joining 
polygons with different names that intersect each other. 


name name how why where 
A B (t1@t2) {t1,t2} {[A.tlname, A.t2.name], 
[A.tl.name, A.t2.name]} 
(t2@t1) {t2,t1} {[Atimame, A.t2.name], 
[A.t1.name, A.t2.name]} 
Table 5: Result provenance for Binary Operations with 
boolean result. 


B A 


Table 5 shows the results of binary operations with boolean re- 
sults and demonstrates that the results are different when compared 
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to the unary operations with boolean results (Table 2) in terms of 
how-provenance. Since it involves two objects, a binary operation 
needs an operator, and the intersect can only be formed with both. 
Thus the ( @ ) is used . Why-provenance has one set of tuples that 
correspond to both tuples’ tokens, and Where-provenance shows 
two tokens with the table for each column. Since it is a self join, 
the where-provenance is equal for both columns involved in the 
join and both tuples. 


Operations with scalar result. - In unary operations, a scalar re- 
sult involves measures for one object. In binary operations, these 
operations will also involve measures between two objects, like the 
function distance. 

To calculate the distance, we need to join the table with itself, 
as in the previous example, although, as we see in the result, the 
distance between the objects is zero because they intersect each 
other. 


name name distance how why where 
A B 0 (t1@t2) {t1,t2} {[A.ti.name, 
A.t2.name], 
[A.t2.name, 
A.t1.name]} 
(t2@t1) {t2,t1} {[A.ti.name, 
At2.name], 
[A.t1.name, 
A.t2.name]} 
Table 6: Result provenance for Binary Operations with scalar 


B A 0 


result. 


The provenance results in Table 6 are similar to the previous, with 
the difference we already saw in unary operations. The “distance” 
column has no where-provenance. 


Operations with spatial result. - these operations use two different 
objects to create a new one. If one thinks of examples with roads 
and maps, these operations can be seen as the basis for map layers 
because they allow the creation of layers over layers with functions 
such as union or difference [26]. Examples of these functions are 
intersection and union, among others. 

The intersection will be the query to show the data provenance 
in this last operation. In this query we will join the table with itself, 
and we will filter by the square with the name “A" in the first table 
and the square with the name “B" in the second. The query result 
will be the small square where the two squares overlap. 

The how- and why-provenance are again the two tokens, and 


since we only selected the intersection, the where-provenance is @. 


Therefore, the unary operations have as provenance result the 
tuple token since they only involve one object. The binary operators 
are different since they can be spatial joins or need a join to allow 


polygon how why where 
POLYGON ((4.5 4.5, 4.5 4,44,44.5,4.54.5)) (t1@t2) {t1,t2} {{]} 
Table 7: Result provenance for Binary Operations with spatial 
result. 
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the operations. Thus, they need an operator in how-provenance. 
Where- and why-provenance has more then one tuple in each set. 

It is essential to refer to the where-provenance, all the func- 
tions used in the projection and create a new column the where- 
provenance is @ as mentioned above. 

Spatial functions also have other functions to help create, alter 
objects or add columns to the tables. Although, those functions 
are like the updates or alters in standard. These functions are not 
involved in selections, hence they are not part of data provenance, 
but there are studies to understand the schema and data changes 
and how to use provenance to help keep track of these changes [11]. 


4 EXPERIMENTAL EVALUATION 


This section will explain our test environment, where we will ex- 
plain the dataset used and the solutions used to perform the tests. 
We will also explain the tests performed and the results obtained, 
and we will finish the section with a discussion about the results 
obtained. 


4.1 Environment 


The database chosen for our test environment is PostgreSQL for 
two main reasons. The first reason is because it is easy to deal 
with spatial objects using the PostGIS extension. The second reason 
is that one of the solutions we will present to generate the data 
provenance information was implemented on PostgreSQL. 

The dataset used is demonstrated in tables 8 and 9. The table 8 
with the name “B" has only polygons, most specifically one triangle 
and three squares, as it is possible to see in figures 2. Whereas the 
table 9 with the name “C" has only a polygon (a square), two points 
and a line as shown in figure 3. Both tables have three columns, 
the column “name" to help to identify objects in the queries, the 
“geom" column, which is the objects’ geometrical representation 
and the token needed for the data provenance. 


name geom token 
A POLYGON((0 0,0 6,6 0,0 0)) tl 
B  POLYGON((4 4,4 6,6 6,6 4,44)) 2 
C  POLYGON((4 0,4 2,6 2,60,40)) 3 
D POLYGON((6 5,6 7,8 7,85,65))  t4 
Table 8: Table B 


Cc 

name geom token 
E  POLYGON((00,0 2,22,20,00) 
F POINT(5.5 4.5) t6 
G LINESTRING(3 7,5 5) {7 
H POINT(6 1) 8 


Table 9: Table C 


We experimentally evaluated the use of two data provenance 
solutions in order to understand if the solutions and theories behind 
the provenance types work with spatial functions. 
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Table B Table C 
Soltuions 
Data provenance types ProvSQL Distributed 
Why x v 
How v v 
Where v x 
Probabilistic v x 


Table 10: Resume of solutions and the provenance types they 
accept 


The first solution is ProvSQL [30] a lightweight extension for 
PostgreSQL. This application not only deals with how- and where- 
provenance but also allows probabilistic queries. It also shows what 
the author call m-semiring to deal with non-monotone queries. 

The second solution ([25]) has a different approach to dealing 
with data provenance. First, it works independently of the databases, 
and second, it can be used in distributed environments. This ap- 
proach has two different models. One module will proceed with the 
query re-writer, and the second module will build the provenance 
information based on the annotations with provenance tuples added 
by the query re-writer. This second solution will also be applied 
over the same PostgreSQL database used in the ProvSQL solution. 
This second approach will give us the why-provenance allowing 
us to incorporate all the three main data provenance types in our 
tests. 

Regarding querys’ syntax, the solutions have different syntaxes 
to allow the provenance information in the final query result. In 
the ProvSQL solution, since it is an extension to PostgreSQL, the 
user needs to write the new data provenance’s function syntax in 
the query depending on what type of data provenance wants to 
obtain, and also needs some pre-processing. The authors provide a 
helpful and straightforward guide to creating the tokens, mapping 
the tokens, and using the data provenance functions. 

The second solution re-writes the query by adding the data prove- 
nance syntax without the user’s concern. In contrast with ProvSQL, 
this solution does not add functions to the query syntax. Instead, 
it adds columns with the tokens divided by different delimiters, 
allowing the second model to build the information. This solution 
assumes that the tables already had the tokens. 

In terms of data provenance syntax in the final result, both solu- 
tions use the same syntax as also used in the definitions of the data 
provenance types [4, 6] 
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Operations 
Result Unary Binary 
boolean ST_IsValid(geom) ST_Intersects(geom,geom) 
scalar ST_Area(geom) ST_Distance(geom,geom) 
spatial ST_Rotate(geom,float) ST_Intersection(geom,geom) 


Table 11: Resume the functions used in the tests by operation 
and result 


Distance how why where 
11.180339887498949 (t2@t8) {t2,t8} —[{] 
13.416407864998739 (t4@t8) {t4,t8} —[(}] 

Table 12: Results of Query 1 


As it is possible to understand with the table 10, to have the three 
main data provenance types, we need to use both solutions. The 
distributed solution is a new approach and, for now, cannot deal 
with the where-provenance and probabilistic, but as mentioned by 
the authors, it is something for future work. The where-provenance 
is not tuple-based and needs different approaches to be built. The 
probabilistic queries can be a challenge because looking at the 
ones used in [30] (c2d, d4, dsharp, weightmc, graph-easy), it is 
necessary to understand how to apply them in not only a distributed 
environment and if all the involved structures support it. 


4.2 Experimental work and results 


The tests performed over our datasets were thought to cover the 
different types of binary and unary operations. Table 11 shows a 
summary of what operations will be used for each type. 

The first experience will be with Query 1. It will include a unary 
operation with a spatial result and a binary operation with a scalar 
result. The query will rotate the squares “B" and “D" from Table 8 
and calculate the distance between the rotated squares to the point 
“H" in Table 9. 


Listing 1: A query with rotation and distance 


SELECT ST_Distance(NewB.geom, C.geom) as Distance 
FROM 
( 
SELECT ST_Rotate(geom, PI()) as geom 
FROM B 
WHERE name = 'B' or name = 'D' 
) NewB, C 
WHERE C.name = 'H' 


Table 12 shows that How-provenance has two tuples and the ® 
operator, meaning that we need both tuples to obtain a row. Why- 
provenance has one set of two witnesses, and where-provenance is 
empty since the distance is a calculated column. 

The next test will be with a query that will use a unary operator 
with a boolean result and a binary operator with the same type of 
result. Query 2 joins the two tables resorting to the binary opera- 
tor intersects, and the output will show if the two polygons that 
intersect each other are valid. 
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name av name bv how why where 
A true E true (t1@t5) — {t1, t5} {[B:t1:1], [], 
[C:t5:1], []} 
B true F true (t2@t6) — {t2, to} {(B:t2:1], [], 
[C:t6:1], []} 
B true G_ true (t2@t7)  {t2, t7} {[B:t2:1], [], 
[C:t7:1], []} 
C true H true (t3@t8) — {t3, t8} {[B:t3:1], [], 
[C:t8:1], []} 


Table 13: Results of Query 2 


Area how why where 
4 (t1@t5) {t1,t5} {[]} 
Table 14: Result of the Query 3 


Listing 2: A query with intersects and IsValid 


SELECT B.name, ST_IsValid(B.geom) as BV, 
C.name, ST_IsValid(C.geom) as CV 

FROM B, C 

‘WHERE ST_intersects(B.geom, C.geom) 


Table 13 shows that the two solutions can deal with spatial 
joins and present the provenance result. In terms of how- and why- 
provenance, the result is similar to the previous one. However, 
we can see a difference in where-provenance. Where-provenance 
has results for two columns, in this case for the “name" columns, 
the other two are @. In ProvSQL, the authors represent where- 
provenance with the table in the first place, then the token and in 
the end, the number is the column index in the table. The columns 
“name" are the first column in both tables. 

The next experience is represented by the Query 3, and it uses 
a unary scalar result operation and a binary operation with a spa- 
tial result. It calculates the area of the new object resulting from 
the show the intersection between polygon “A" in table 8 and the 
polygon “E" in table 9. 


Listing 3: A query with the area of a new object 


SELECT ST_Area(ST_Intersection(B.geom, C.geom)) 
FROM B, C 
WHERE B.name = 'A' 

AND C.name = 'E' 


In Section 3.1, we demonstrated that when we use the area func- 
tion alone, the provenance result has just one tuple, ie., the row 
tuple. Although, in Table 14, we can see that the how- and why- 
provenance have two tuples, and the first one is conjugate by the ®. 
The binary (“Intersection") and the unary (“Area") functions, if used 
independently from each other, would give different results. The 
how- and why-provenance for the “Area" would be only the tuple 
token, and for the binary, it would be the conjugation of two tokens 
because we were joining (intersecting) two polygons. The result 
in 14 demonstrates that when we applied both simultaneously, the 
joining data provenance prevails over the unary. It is also possible 
to understand that since the column presented is created by the 
functions, the where-provenance is not applied. 
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4.3 Results discussion 


The results of our tests demonstrated that the solutions work with 
spatial objects and spatial functions independently of type (unary 
or binary). Consequently, it demonstrates that data provenance 
theories can be applied to different data types from the standard. 

Although, some spatial functions can have variations that use 
more than two values, and the input can also receive an array or 
an aggregation. In the aggregation, these functions will behave as 
standard functions, such as the average of values (AVG). One of 
these operators is the union. Unfortunately, none of the solutions 
supports, for now, this type of aggregation. 

As stated in [29], the aggregation operators need a paradigm 
change from semirings to semimodules. In [2], is an example of 
a study on how to use the semimodules to represent that kind of 
operator. However, there is still no practical work with a solution 
applying the semimodules. 


5 CONCLUSION 


In this work, we addressed the topics of data provenance and spatial 
data. 

We presented the basic concepts about data provenance and spa- 
tial databases and a brief literature review showing that combining 
these two subjects is still an open research topic. 

Next, we presented a study on how to compute and interpret data 
provenance results based on a classification of spatial operations 
and three main types of data provenance (how-, why- and where- 
provenance). The results show that these tools, which were devel- 
oped for managing provenance in general-purpose (non-spatial) 
databases, are also capable of handling different types of spatial 
operations, namely predicates, topological operations, and spatial 
joins, among others. The exceptions are aggregations and set oper- 
ations, for instance, to compute the union of spatial objects. 

In future work, we would perform more tests with data prove- 
nance but now with Spatio-temporal data to understand how tem- 
poral queries can affect the data provenance and if we can collect 
the data provenance from such specific operations as the temporal 
operations. 
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ABSTRACT 


This paper investigates the similarity between the Indus Valley 
script and the Kannada, Malayalam, Tamil, and Telugu scripts that 
are used to write Dravidian languages. The closeness of these scripts 
is determined by applying a feature analysis of each sign of these 
scripts and creating similarity matrices that describe the similarity 
of any pair of signs from two different scripts. The feature list that 
we use for the analysis of these Dravidian language-related scripts 
includes six new features beyond the thirteen features that were 
used for the study of Minoan Linear A and related scripts by Revesz. 
These new features are the check mark, short vertical line, dot, 
upper curve, parallel curves, and horizontal line features. 
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1 INTRODUCTION 


It is strongly believed by most of the people that the first human 
civilization flourished somewhere near the present day upper east- 
ern part of Africa and that all humanity at that time used to speak a 
single language called a protolanguage, which is the origin of all the 
languages spoken in today’s world [13]. The protolanguage spread 
and diversified together with human populations as humans started 
to leave the Sahara when the temperatures started soaring and the 
desertification of the Sahara begun. The desertification prompted 
people to split into small groups and to travel to different places in 
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search of food, shelter, and viable climatic conditions. This process 
resulted in a change in the living style of people along with their 
environmental needs, requirements, and way of communicating. 
Although many scientists and researchers believe in the concept 
of divergence of languages from a protolanguage, this hypothesis 
is still controversial. Finding how similar two languages is a com- 
plex problem. The following are three major ways which help us 
determine how closely languages are related. 


1.1 Human migrations 


In this method, we try tracking people’s migration throughout his- 
tory and observe how does this migration affected the languages. 
Generally, the scientists relate linguistics to molecular biology. From 
the concept of tracking the mitochondria present inside the nucleus 
of the human body one can trace back people’s ancestors, and 
research suggests this process also works well for finding the lan- 
guage path. However, we cannot completely rely on our process 
in this method, since when starting to go far back in time we will 
have less evidence and no accurate metrics on which to base our 
assumptions. 


1.2 Similar sounding words 


We know that there are many languages that are derived from others 
which contain the same words which convey similar meanings. 
However, there is a very high probability of a word with the same 
sound having a different meaning. These are known as homophones. 
For example, the word filter in English coveys a meaning of a 
substance which is used to separate different things, but the same 
word means ‘poison’ in French. Such words are false cognates. 
Hence simply looking for similar sounding words is a faulty method. 


1.3. Feature analysis 


In this approach, we find the similarity between two languages 
by observing the similarity between the scripts and their regu- 
lar changes. This process is done by developing features which 
represent all the letters in the scripts and developing the feature 
evaluation table. When we have the feature analysis tables for at 
least two languages we can create the similarity matrix to check 
how close the two scripts are related. We follow this method in our 
implementation process. 


2 BACKGROUND 


The Indus Valley Script is an ancient script developed by the Indus 
Valley civilization, which existed c. 3500-1900 BCE. The Indus Valley 
Civilization was first identified at Harappa and Mohenjo-Daro in 
1921 and 1922, respectively [7]. The first publication of the seal 
with Harappan symbols were produced in 1875 in the drawings of 
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Figure 1: Feature analysis of Linear A signs according to Revesz [9]. 


Sir Alexander Cunningham. Mahadevan [5] proposed a list of signs 
with 417 distinct symbols in 1977. Later, the Corpus of Indus Seals 
and Inscriptions (CISI) introduced 386 different symbols [4, 6, 7]. 

The Indus Valley Civilization originated during the same period 
as the Sumerian civilization. The Indus Valley and its river trib- 
utaries provided basic food and transportation to the people like 
the Euphrates and the Tigris Rivers in Mesopotamia. The Indus 
Valley civilization had brick homes, baths, and forts, and used cop- 
per and bronze metals to make tools and weaponry. Different seals 
were used for commerce which were attached to trade goods and 
showed a mix of symbols. The most important settlement areas 
were Mohenjo-Daro and Harappa which contained about 35,000 
people. Much research showed evidence of trade between Indus 
Valley and Mesopotamia [12]. 

The Dravidian language family represents about thirty languages 
that are common today in Southern India, including the Kannada, 
Malayalam, Tamil, and Telugu [14]. Daggumati and Revesz [1-3] 
suggests the possibility of the migration of proto-Dravidian peo- 
ple to the Indus Valley from Mesopotamia because Sumerian pic- 
tograms are the most like Indus Valley Script signs among a set of an- 
cient scripts. In addition, Proto-Dravidian piru and Mesopotamian 
pirus both mean ‘elephant’ [12]. The prevalence of Dravidian cog- 
nates in the Rig-Veda suggests that Dravidian and Aryan speakers 
had merged into one language in the large Indo-Gangetic Plain by 
the time of its composition, while independent Dravidian groups 
had moved to the boundary of the Indo-Aryan area. The history 
of Dravidian language evolution is hard to study because the ear- 
liest Tamil inscriptions, which were found in the Madurai and 
Tirunelveli districts of Tamil Nadu, date only from the 2nd century 
BCE. Perhaps the decipherment of the Indus Valley script could 
shed more light on the evolution of Dravidian languages. 


3 ANEW FEATURE ANALYSIS METHOD 


In this paper, we follow the third method of finding similarity among 
scripts, that is, by using feature analysis and similarity matrices. 


3.1 Feature Analysis 


The concept of developing features and thereby presenting the 
results using similarity matrices is initially suggested by Revesz 
[8, 9]. 
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Revesz [9] found thirteen features that seem to commonly occur 
in various scripts. These thirteen features can distinguish all the 
signs in various ancient scripts. For example, Figure 1 shows a 
feature analysis of the Minoan Linear A script, where features have 
a symbol (contains curved line: (, contains an enclosed region: O, 
has a slanted straight line: 
marked as red and features that are absent are marked as black. 
Given feature tables for two different scripts, a similarity matrix 
can be generated from them, such as for the Linear A script and 
the Carian alphabet [2]. In a general view, a similarity matrix helps 
us to visualize how close the two scripts are at a higher level. This 
similarity matrix is created by calculating the absolute difference 
between features of a particular letter in one evaluation table to all 
the features of a letter in the other evaluation table. This process is 
to be done for all features of each letter in the first evaluation table. 
The output of this process will be a distance matrix. Then we need 
to subtract every element in the distance matrix with total number 
of features, thirteen in this case, to get the similarity matrix. 


etc.). Features that are present are 


3.2 Our approach 


We have considered the Indus Valley Script and those scripts that 
are used to write the Dravidian languages of Kannada, Malayalam, 
Tamil, and Telugu. We applied feature analysis on these languages 
and try to find similarities among them. We considered 25 of the 
most common letters from each language and started our process. 
Unlike western language scripts the Dravidian scripts are more 
cursive, and we were required to add some extra features to the 
thirteen features that were proposed in [9]. The new features help to 
analyze some details of the cursive Dravidian scripts to improve the 
accuracy of defining the script signs and comparing them. Figure 2 
shows the additional features that we introduced for the sake of an 
improved analysis. 

In Figure 2, the check mark has been a predominant feature 
in the Telugu scripts and has played a major role in changing the 
pronunciation of the script signs. In the Kannada, Tamil, and Telugu 
scripts the presence of a short vertical line, dot, and upper curve 
have a very different meaning were compared to their absence in 
the signs of these scripts. The horizontal line in the Malayalam 
script alone distinguishes more than two signs. Finally, we included 
parallel curves as these Dravidian scripts are more cursive than 
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Figure 2: We introduce the following new features from top 
to bottom: check mark, short vertical line, dot, upper curve, 
parallel curves, and horizontal line. 


the straight-line strokes. For example, there are some Telugu script 
signs that are differentiated with a single dot mark alone. 

After developing these feature analysis tables, we needed to 
create similarity matrices between any two considered language 
scripts. This Similarity matrix will be a N x N matrix where N is the 
number of considered letters for the analysis. Hence, each similarity 
matrix in our context will be 25 x 25 matrix and contain 625 entries. 
Therefore, calculating all these entries manually is a very time- 
consuming process besides being prone to mistakes. Hence, we 
decided to develop a computer program such that it calculates all 
the values accurately and effectively. Below we present the process 
of how we treated the values in the feature evaluation table and 
used them as inputs in the similarity matrix, together with how we 
developed the logic for the matrix calculation. 

Initially we wanted to consider all the features for a particular 
sign as a single vector. Hence, the features that are marked red 
(the features which are present in the letter) are considered as 1’s 
and the remaining black marked features (the features which are 
not present in that letter) are considered as 0’s. Therefore, we can 
extract a total of 25 vectors (from the 25 signs) from one feature 
evaluation table. These 25 vectors were compared separately with 
all other 25 feature vectors of the second feature evaluation table. 
Figure 3 shows the feature analysis matrix for the Malayalam script. 
Figure 4 shows the feature analysis matrix for the Telugu script. 

After the formation of the two feature matrices, we need to 
transpose one of the matrices to facilitate certain matrix operations. 
Here we have two 25 x 16 matrices and since we need to perform 
multiplication functions during the process of forming a similarity 
matrix, we will encounter a dimensional mismatch error if we do 
not transpose one of the two feature vector matrices. 

We had everything set to apply our main operation to create 
the similarity matrix, but the question is what this main operation 
exactly should be. Before discussing that, let’s comprehend and 
analyze how we form a similarity matrix in the traditional way. 
We calculate the absolute difference between two features in their 
respective position and remove this difference from the total fea- 
tures value to get the similarity number. For doing this we initially 
tried with three methods. One is by using the dot product. We all 
know that the dot product tells us about the angle between the two 
vectors (A.B = A*B*cos(@)) where 6 is the angle which determines 
by how much these two vectors got deviated from one another. 
When we try implementing this model unlike the real dot prod- 
uct the machine was performing a simple matrix multiplication (a 
weighted sum of vectors) due to which we tend to lose some of the 
feature values. 

In the second method we try implementing XOR operation on the 
feature vectors which return value 1 only when there are different 
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corresponding vectors (0 and 1, 1 and 0) which exactly what we 
expect the result to be. But again, we encountered trouble during 
its implementation. Applying the XOR operation upon the vectors 
gives the bitwise XOR results rather than the element-wise results. 
Due to this, the final matrix has a dimension of 25 x 16 unlike the 
square matrix 25 x 25 that we expect. 

The third method is more like a hybrid of the first two methods. 
It performs Elementwise XOR weighted sum on the vector matrices 
giving us the absolute difference of a particular feature vector with 
all feature vectors in the other vector matrix and vice-versa. This 
result is a 25 x 25 matrix with correct and true values. This gener- 
ated matrix is a distance matrix and in-order to get the similarity 
matrix we must subtract every entry in the distance matrix with 
16 which is the total features taken for our problem domain. The 
high value numbers in the similarity matrix represents the strong 
closeness and low values represent the least connectivity between 
the corresponding signs in the similarity matrix. 

Finally, we presented these similarity matrices using heat maps 
for better visualization. We used a color gradient from bright blue 
to dark red to represent the values inside the matrix where red is 
assigned for high values and blue for low values. 


4 DISCUSSION 


In this section we present the feature analysis for the Malayalam 
Script, screenshots of our process consisting of different matrices 
we discussed earlier and finally some output heat maps. The heat 
maps are presented for the Telugu-Malayalam and Kannada-Telugu 
languages which contains the total of sixteen features in the feature 
evaluation table. 

In the upper left of Figure 5 from the feature evaluation table 
all the 16 features for 25 signs are represented in vector notation 
making it a 25 x 16 matrix, where 25 is the number of signs and 
16 is the number of features. Since we need two vector matrices 
to create a similarity matrix we transpose (upper right of Figure 
5) one of the vector matrices to facilitate the elementwise XOR 
multiplication. The dot product (25 x 25) of these two matrices, and 
the XOR matrix do not lead us to the similarity matrix because 
they perform a simple matrix multiplication and bitwise XOR (25 x 
16) respectively. To create a similarity matrix, we need to perform 
Elementwise XOR multiplication (25 x 25) of the matrices, which 
calculates the weighted sum of absolute difference between any 
two feature vectors as shown in the lower left of Figure 5. This is 
the definition of a distance matrix. The similarity matrix is found 
by subtracting the total number of features with every element in 
the 25 x 25 distance matrix as shown in the lower right of Figure 5. 

From a similarity matrix, it is easy to generate a heat map. For 
example, the Malayalam-Telugu heat map is hsonw in Figure 6, and 
the Kannada-Telugu heat map is shown in Figure 7. We can see the 
highest value of 16 and lowest value of 8 which shows that there 
are high similar signs and many low similar signs respectively. The 
graph shows that it is majorly dominated by the red color rather 
than blue which shows there is a lot of similarity between the two 
language scripts. Similarly considering the Malayalam and Telugu 
heat map there are a smaller number of highly matched words 
which have value of 16 and there is a lot of blue signs in the heat 
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Figure 4: Feature analysis of the Telugu script. 
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Figure 5: Telugu feature matrix (upper left), transpose of the Malayalam feature matrix (upper right), distance matrix (lower 


left), and a Malayalam-Telugu similarity matrix (lower right). 


map with lowest value of 9. This shows that both scripts differ a 
lot compared to the above heat map. 


5 CONCLUSION AND FUTURE WORK 


The Dravidian Languages which include Telugu, Tamil, Kannada, 
and Malayalam are generally known as distinct cousins and are 


relatively closely related when compared to the Indus Valley Script. 


Indus valley scripts have been undeciphered until today but there 
has been a lot of extraction of different kinds of symbols and seals 
recently. Among the Dravidian languages Telugu and Kannada 
seem closely related. Though some of the signs in the Tamil script 
contain a straight-line stroke most of the other signs and signs 
in other three Dravidian scripts are cursive. This project helps in 
finding out the similarity between the scripts that are expected to 
be derived from the undeciphered scripts and help us in finding 
out the evolution of languages. Our goal is to ease the exhaustive 
calculations in finding out the similarity matrix between two scripts 
during comparison. The project has a high scalability factor. It can 
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be extended by passing the feature vector values directly from 
the created vector table rather than passing them through NumPy 
arrays. This process can be flexibly applied to words and thereby 
construct an evolutionary tree as a future work. In addition, feature 
analysis can be extended from script analysis to art motif analysis 
[10] and higher-level textual analysis [11]. 
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Figure 6: Heat map for Malayalam and Telugu scripts. 
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Figure 7: Heat map for Kannada and Telugu scripts. 
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ABSTRACT 


Query optimization is a challenging process of DBMSs. When tack- 
ling query optimization in the cloud, there exists a simultaneous 
need of providing an optimal physical query execution plan, as well 
as an optimal resource configuration among available ones. Cloud 
computing features like resource elasticity and pricing make the 
process of finding this optimal query plan a multi-objective problem, 
with the monetary cost being an equally important factor to query 
execution time. Apache Spark is a popular choice for managing big 
data in the cloud. However, query optimization in its SQL module 
(Spark SQL) involves a number of limitations due to the rule-based 
nature of its optimizer, Catalyst. We propose a multi-objective cost 
model for the extension of the query optimizer of Apache Spark, 
aiming to minimize both objectives of query execution time and 
monetary cost, as well as a methodology for exploring the space 
of Pareto-optimal query plans and selecting one. The cost model is 
implemented and tuned, and an experimental study is conducted 
to validate its accuracy. 


CCS CONCEPTS 


- Information Systems — Query optimization. 


KEYWORDS 


query optimization, cost model, cloud computing, multi-objective 
optimization, Apache Spark, Catalyst optimizer 
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1 INTRODUCTION 


Query optimization is the most challenging step of query process- 
ing. A query optimizer can either be rule-based, using heuristics 
to convert the logical query plan to a physical one, or cost-based, 
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using cost functions to compare alternative query plans and re- 
turn the optimal according to its estimations. The architecture of a 
cost-based query optimizer consists of three key stages that deter- 
mine the quality of its predictions [9]: cardinality estimation, cost 
modeling and plan enumeration. 

The majority of works on query optimization aims to minimize 
query execution time on fixed hardware, which is a valid assump- 
tion in the on-premise world. Query processing in the Cloud, how- 
ever, presents an extra challenge as alternative hardware instances 
are available. Depending on the resource configuration that is used, 
the execution might be completely different. Decision making in- 
volves deciding on the type of the cluster, the number of instances 
that will be used, their type, and their characteristics (e.g. RAM 
size). In such a scenario, a query optimizer should be able to pick 
both an optimal physical query execution plan, as well as a resource 
configuration among available hardware instances, thus bridging 
the gap between query and resource optimization [18]. 

In order to achieve this, optimizer cost models should be hardware- 
agnostic, being able to model the behaviour of a query plan in dif- 
ferent clusters and systems. A hardware-agnostic cost model could 
lead to lower costs as well as better resource efficiency [14]. 

Query optimization is usually associated with minimizing query 
execution time. The performance of a query, however, can be evalu- 
ated in terms of more objectives. Features of the cloud, like resource 
elasticity and pricing increase the objectives that can be simultane- 
ously optimized in a cloud setting [5]. Adding instances to achieve 
maximum parallelization during query execution will in general 
lead to lower execution times, but may also lead to much higher 
monetary costs, as well as increased energy consumption. Mone- 
tary cost is one of the most prevalent query optimization objectives 
in the cloud [8, 10, 15]. Energy consumption has also been consid- 
ered lately [13], as cloud providers yearn for reducing energy cost. 
Other objectives that can be considered in multi-objective query 
optimization are result precision [16], or data security. 

Query optimization in big data systems, which are usually hosted 
in the cloud, is particularly challenging. As a result, it is necessary 
that the estimation components of query optimizers (cardinality 
estimator, cost model) are accurate. One of the most popular big 
data processing frameworks is Apache Spark, which is widely used 
in research and industry. However, the query optimizer of Spark’s 
SQL-based component, Spark SQL, has a limited cost model. 

In this work, we propose a multi-objective cost model for Spark 
SQL, for the objectives of query execution time and monetary cost. 
For the time objective, we adopt an existing single objective cost 
model [4] for Spark SQL, which shows promising accuracy. We also 
conduct a detailed experimental study to validate it. For the money 
objective, we introduce a formula in order to estimate the monetary 
cost of a query, based on real cloud pricings. 
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The cost model receives a query and a set of Spark application 
configurations as an input, and returns the optimal query plans for 
each configuration, resulting in a Pareto front. The returned plans 
present different tradeoffs for the two objectives, and the user can 
either select one or assign preference weights to the two objectives, 
and be provided with a single query plan that best meets them. 

We also conduct an experimental study on a private cloud envi- 
ronment to validate the cost model accuracy and optimality for a 
broadly adopted architecture, consisting of Spark, Yarn and HDFS. 

Overall, the contributions of this work are the following: 


e proposal of a multi-objective cost model for Spark SQL 

e introduction of a formula for query monetary cost estimation 

e reimplementation and validation of existing cost model for query 
execution time estimation 

e adetailed experimental study on a real private cloud environment 

e auser-interactive method for exploring the space of alternative 
query plans and choosing an optimal one 


The rest of this paper is organized as follows. Related work is 
discussed in Section 2. Section 3 describes the cost model that we 
implemented. The experimental evaluation and its results follow in 
Section 4, while Section 5 concludes the paper. 


2 RELATED WORK 


Optimization in Spark The Spark SQL query optimizer, Cata- 
lyst [3], is an extensible optimizer where new rules can be added. 
However, it is not ideal for cost-based query optimization. It only 
uses a limited cost model, being unable to provide analytical es- 
timations for the execution time of a query plan. A number of 
research works have focused on improving specific limitations of 
the Catalyst optimizer [17]. 

Although Spark is highly configurable, its manual tuning is time 
consuming and complex, due to the high-dimensional configura- 
tion space. A lot of works provide frameworks for tuning Spark 
applications [12, 15], in most cases with learned methods. The pro- 
posed cost model can be useful in this perspective too, as apart 
from producing optimal query plans, it can also be used for tuning 
and comparing different application configurations. 

Multi-objective query optimization Karampaglis et al. [6] 
proposed a bi-objective query cost model suitable for query opti- 
mization over a multi-cloud environment. It successfully provides 
estimates of both the expected execution time and monetary cost. 

A number of works have considered multi-objective query op- 
timization in the cloud. Kllapi et al. [7] proposed a technique to 
optimize dataflow scheduling on a set of containers and form one 
schedule best meeting user constraints. Their work can also be 
used for query optimization, when the execution of a query can 
happen over multiple containers. Multi-objective parametric query 
optimization [16] takes a different approach to query optimization, 
which happens before runtime with the use of an exhaustive DP 
algorithm, and models queries as functions of parameters. 


3 COST MODEL AND IMPLEMENTATION 


In this work, we propose a cost model for cost-based multi-objective 
query optimization in Apache Spark. For the objective of query 
execution time, we adopt a proposed cost model for Spark SQL 
[4], which we also experimentally evaluated. For the objective of 
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Figure 1: System architecture 


Catalyst C.M. Proposed C.M. 
Query types ALL GPSJ 

Cost based join selection YES YES 
Tables and Columns statistics YES YES 
Considers cluster topology NO YES 
Based on system disk access time NO YES 
Takes into account network speed NO YES 
Analytic estimation of QET NO YES 


Table 1: Comparison of Catalyst and proposed cost model 
monetary cost, we introduced a cost estimation formula for a given 
query plan. By combining them, we tackle query optimization as 
a multi-objective optimization (MOO) problem and use different 
methods to explore the space of Pareto-optimal query plans. 


3.1 System Architecture 


Figure 1 shows the system upon which the cost model operates. 
The storage layer includes a number of datanodes inside an HDFS 
filesystem. Data is processed in Spark, and Yarn is the resource 
negotiator between HDFS and Spark. For data management, data 
is stored in Apache Hive tables and accessed through Spark SQL 
queries. As for the optimizer, an extended version of the Catalyst is 
envisioned, operating cost-based by using the proposed cost model. 
We implemented the cost model outside Spark and used it manually. 


3.2 Cost Model Preliminaries 


The proposed single-objective Spark SQL cost model [4] can provide 
more than Catalyst when it comes to estimating query execution, as 
highlighted by Table 1. It is based on disk access time and network 
speed, as disk and network performance is critical in one-pass 
workloads, like Spark SQL queries. It is a reconfigurable model, 
that can be tuned for any (homogeneous) cluster and system. It 
also performs traditional SQL optimizations by collecting table and 
columns statistics. It is aware of the cluster topology, and takes 
into account Spark application parameters that influence query 
execution time. The Spark application parameters considered are 
the number of Spark executors, and the number of executor cores. 

One of its limitations is that it covers the class of Generalized 
Projection, Selection, and Join (GPSJ) queries, which are a subset of 


71 


Multi-objective query optimization in Spark SQL 


SQL queries. This means that the use of specific SQL operators like 
UNION ALL or OUTER JOIN are not supported by the cost model. 
In addition to that, its use is also limited for homogeneous clusters. 

The cost model is capable of analytically estimating the execu- 
tion time of the five essential RDD transformations that occur in 
Spark SQL GPSJ queries: (1) Full table scan (2) Full table scan and 
broadcast (3) Shuffle hash join (4) Broadcast hash join (5) Group by. 
Each of the transformations is modeled as a function in the cost 
model code. Precisely modeling these operations is challenging, 
so the cost model focuses on a set of basic bricks that determine 
transformations and actions cost, for which it provides cost esti- 
mates (Read, Write, Shuffle Read, Broadcast). Each one of these 
functions receives a data table set as an input (or two, in case of 
a join operator), as well as the table’s cardinality, size, partitions 
and any filtering predicates. It returns the estimated time needed to 
execute the transformation, the columns returned, the cardinality 
and the adjusted size of the table set. The estimated execution time 
for a query is obtained by summing up the time needed to execute 
each RDD transformation forming the query physical plan. 


3.3 Bi-objective cost model 


In the Spark-Yarn-HDFS architecture, query execution involves two 
parts. A user submits a query, and then specifies some parameters to 
configure the Spark application. Spark application tuning, although 
often done empirically, is a complex decision to make, as Spark 
has a considerable number of parameters that can be configured. 
One of the most critical parameters is the number of executors 
that will be allocated for an application. Each Spark executor runs 
within a Yarn container. Yarn containers are provided by Yarn on 
demand at the start of each Spark Application. Each one hosts a 
Spark executor as well as a number of cores that are assigned to it. 
They are deallocated when the Spark application completes. 

The second part of our cost model involves the prediction of 
the monetary cost of executing a query. In a public IaaS cloud 
platform, execution of a Spark SQL query requires renting a number 
of computing instances to host the Spark executors, using them 
during the runtime of a query and leasing them when execution is 
completed. As a result, we make monetary cost estimations with 
the following formula: 


cost = ci(#executors) * runtime « hcost($/hour)/3600 (1) 


ci represents the number of computing instances rented as a 
function of the number of executors and hcost is the hourly cost 
of using a single computing instance, which we divide by 3600 to 
scalarize it to seconds. The formula assumes per-second billing. 

To define the ci function, we need to assume a Spark application 
deployment method. In our work, Spark application deployment 
was well spread, as we assigned each executor on a different com- 
puting instance, in order to achieve maximum parallelization. As a 
result, ci(#executors) = #executors 

In our experimental part, we used prices from Amazon EC2 
instances. For an example of a Spark application with 4 executors 
and 4 executor cores, we rent 4 homogeneous computing instances 
with 4vCPUs each, for the time needed for the query to execute. If 
we use al.xlarge instances ($0.102 hourly cost) and the query takes 
20 minutes to complete, its cost is estimated to be about $0.136. 
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3.4 Query plan enumeration 


For a given query, the cost model is able to compare all valid query 
plans. In our case, two alternative join operators are available and 
we considere all their combinations, as well as all possible join 
orderings. We base our monetary cost estimations on the price 
of renting an al.medium instance from Amazon EC2, considering 
cases of renting from one to eight al.medium instances for query 
execution. The cost model is easily extensible therefore prices for 
more computing instance types can be included, to compare differ- 
ent cluster and application scenarios. As a result, our search space 
for a query involving X join operations involves 2* join opera- 
tor combinations, X! join orderings, and 8“N available application 
configurations, for N computing instance types. For example, for 
TPC-H Query 3 that involves 3 join operations and considering only 
one computing instance, results in 384 alternative query plans. The 
search space can become much larger for more complex queries, 
however our cost model is able to compare all plans for queries 
with up to 6 joins with no significant optimization overhead. 

For the case of even more complex queries, the search space can 
be reduced significantly if we do not perform join reordering but 
keep the join order that the Catalyst produces, and if we introduce 
distributed query optimization heuristics for join selection [11]. 

In the experimental part, we also had to overcome the limitation 
of Catalyst returning a single query plan and not providing alterna- 
tives. By reconfiguring certain configuration parameters (disabling 
broadcast joins, changing broadcast joins thresholds), we were able 
to produce and compare more query plans. 


3.5 Multi-objective optimization 


We follow a three-step process for multi-objective optimization. 

Step 1 First, we apply single-objective optimization to find the 
optimal query plan in terms of execution time, for every available 
system configuration. A single query plan is returned for each con- 
figuration. As query monetary cost is dependent on execution time, 
the fastest plan is also the cheapest, for a fixed system configura- 
tion, which explains the reason behind this first step. The different 
tradeoffs are created by the alternative configurations, and not by 
alternative query plans inside a certain application setting. 

Step 2 The second step is the multi-objective optimization one, 
as all the query plans from the first step are compared in terms of 
both objectives. Dominated plans are discarded, and the remaining, 
Pareto-optimal plans are the output, forming a Pareto front. The 
selected query plan will determine both the physical query plan 
that will be executed, as well as the hardware configuration. 

Step 3 After the Pareto front is formed, the final step is the query 
plan selection. As the number of alternative query plans can be 
large, the process of presenting the alternatives to the user, and 
assisting him/her to make a decision is challenging. In order to 
reduce the number of alternatives presented to the user and take 
into account budget and needs, price and latency filters can be 
applied. The cost model can also be used in a user-interactive mode, 
where the user submits preference weights to the objectives and 
receives a single plan best meeting them. In that case the problem 
is scalarized to a single-objective one, using the equation: 


1 1 


BO 1+ wy * fi(x) : 1+c* we * f2(x) 


(2) 
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In order to normalize the values of time and money to the same 
order of magnitude we use constant c. We set c to an empirical value 
of 25000, so that 25 seconds of query execution are equivalent with a 
monetary cost of 0.001 US $. The value for c was selected empirically 
based on our experiments, in which execution time varied between 
50-400 seconds depending on the query and configuration and 
monetary cost between 0.001 - 0.01 US $ per query. In the case of 
equal weights, a query plan near the middle of the Pareto curve is 
selected, meaning that it does not prioritize any of the time-money 
objectives over the other. The query plan that has the maximum 
value for F is selected for execution. Before the user submits the 
preferred weights, he/she can also be provided with a value showing 
the time-money relationship for the selected weights. 


4 EXPERIMENTAL EVALUATION 


Methodology The single-objective cost model makes the following 
assumptions, which our work inherits too: 


It covers the class of GPSJ queries 

It assumes uniform distribution of data in table sets 

It performs single query optimization, assuming a cold start 
It assumes operating on a homogeneous cluster 


We evaluated the cost model for a main-memory scenario, assum- 
ing that all data fits in memory, as well as intermediate results. We 
also assumed exclusion of exogenous factors potentially affecting 
cluster performance, and we calculated a Spark-Yarn initialization 
overhead in each query, which we did not take into account. 

For the evaluation of the cost model, we follow a two step method- 
ology. First, we examine the estimation accuracy of the cost model. 
Second, we evaluate the optimality of the cost model, aiming to see 
if it can point to an optimal query plan among alternatives. 

The cost model can be used to produce a Pareto front, including 
plans that offer different time-money tradeoffs. The decision maker 
can choose the query plan that best suits his/her needs or can use 
the cost model in a user-interactive mode, and assign weights to 
the objectives, in order to be provided with a single query plan. 

Setup We conducted our experiments in Grid ’5000 [1], a large- 
scale and flexible platform for experiment-driven research in com- 
puter science, with a focus on parallel and distributed computing. 

In our experiments Grid ’5000 was used as a private cloud, where 
we deployed up to 8 homogeneous computing nodes, each one 
containing 2 CPUs Intel Xeon E5-2630 v3, 8 cores/CPU, 128GB 
RAM, 5x558GB HDD, 186GB SSD, and 2 x 10Gb Ethernet. Inside 
our cluster, we set up an HDFS filesystem where each node worked 
as a datanode, and our dataset was stored in the SSDs. 

Experimental Evaluation The single-objective cost model was 
reimplemented to model each one of the five RDD transformations. 
For our experiments, we used TPC-H benchmark queries and its 
dataset, scaled to 100GBs. The performance of the cost model was 
validated for many different factors. This allowed us to fine-tune 
the cost model for our system, and also re-evaluate it and observe 
its strengths and inaccuracies. Figure 2 shows its performance for 
a number of TPC-H queries (7.7% error) in a scenario with 4 Spark 
executors and 4 executor cores. The estimations are quite accurate 
with the exception of Query 10, where the execution time of a 
number of broadcast joins is overestimated. Figure 3 shows that the 
cost model is also able to capture the impact adding Spark executors 
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has for query execution, with an error rate of 12.1%. The query 
execution time values are an average for a set of TPC-H queries. A 
more detailed evaluation of the cost model estimation accuracy can 
be found in the thesis of Georgoulakis Misegiannis (2021), upon 
which this paper is based [2]. 

Inaccuracies mainly have to do with the fact that the cost model 
does not precisely model Spark data actions and transformations, 
but only provides estimates on key operations. Furthermore, in 
some cases it assumes linear relations between different factors 
not considering heuristics or overkills that occur in Spark. Finally, 
stochastic processes happening in the cluster might be influenc- 
ing system characteristics like the read/write throughput or the 
network speed, causing minor inaccuracies. As a result, tuning the 
cost model for a given cluster was a challenging task. 

In terms of optimality (prediction accuracy), the cost model 
makes correct predictions for trivial cases. For complex queries, 
when it did not point to the optimal plan, it was able to at least spot 
the trends and propose a near-optimal query plan, having small 
impact in query execution time. Figure 4 shows the performance 
of the cost model on 3 alternative query plans for TPC-H Query 
2 and a configuration involving 4 Spark executors and 4 executor 
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cores. As the table sets considered in Query 2 are quite small, the 
query plan involving only broadcast joins performs better than 
the alternatives. Shuffle joining in every case results in a slightly 
worse execution time, whereas operating the first join with a shuffle 
join operator and the other two with a broadcast one has similar 
performance with the optimal case. The cost model gives equally 
good estimates for the two best execution plans. 

The cost model is able to make better choices than the Spark SQL 
optimizer in many cases. For example, for a simple query involving 
just a join operation (Fig. 5), we can see that a broadcast hash join 
is a better choice, resulting in 8 seconds faster query execution 
than using a shuffle hash join. However, the Catalyst by default 
chooses to shuffle join these tables. The cost model is able to point 
to the broadcast hash join as the better option. In the experimental 
study, the cost model proved that it can be a "relevant first step 
for turning Catalyst into a fully cost based optimizer", showing 
significant estimation accuracy while staying on point when it 
comes to optimality and prediction quality. 

When it comes to performing multi-objective optimization, Fig- 
ure 6 shows the outcome of the extended cost model for a scenario 
with 3 alternative Spark application configurations with a varying 
number of executors (2,4 and 8), and 4 executor cores. 

The cost model returns the fastest query plan for each application 
configuration, and then the plans are compared in terms of both 
objectives. In that case, all three query plans are Pareto optimal and 
form a Pareto front, as they present different time-money tradeoffs. 
Thus, the problem is formulated to a multi-objective optimization 
one. The user can either decide himself/herself which query plan 
best fits his/her application, or can assign weights of preferences to 
the objectives in the user-interactive mode of the cost model. For 
the case of wi = wo, the plan with the 4 executors is picked. In case 
of wi = 2 * we, the plan with the 8 executors is picked, and in case 
of wz = 2 * wy the selected plan is the one with 2 executors, as the 
time-money relationship changes each time. 
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5 CONCLUSION - FUTURE WORK 


In this paper we proposed a multi-objective cost model for query 
optimization in Spark SQL. We built on a promising proposed cost 
model, which we extended with a formula for estimating the mone- 
tary cost of query plans in Spark. The cost model is able to compare 
query plans providing different time-money tradeoffs, and we also 
introduce a method for assisting the user into picking a single 
one. The cost model was implemented and tested, as a detailed 
experimental study was conducted in a private cloud environment. 

In the future, we aim to extend the cost model to consider het- 
erogeneous resources. This will require modeling the execution 
costs on different hardware, like GPUs. We also aim to explore more 
optimization goals, like energy consumption, which is a critical 
objective in green and sustainable data centers. Finally, a long term 
goal is to try to integrate the cost model into the Catalyst. 
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ABSTRACT 


The management of Big Data requires flexible systems to handle 
the heterogeneity of data models as well as the complexity of ana- 
lytical workflows. Traditional systems like data warehouses have 
reached their limits due to their rigid schema-on-write paradigm, 
that requires well identified and defined use cases to ingest data. 
Data lakes, with their schema-on-read paradigm, have been pro- 
posed as more flexible systems in which raw data are directly stored 
in their original format associated with metadata, to be accessed 
and transformed only when users need to process or analyze them. 
Thus, it is necessary to define and control the different levels of ab- 
straction and the dependencies among functionalities of a data lake 
to use it efficiently. In this article, we present a formal framework 
aiming to define a data lake pattern and to unify the interactions 
among the functionalities. We use the category theory as theoreti- 
cal foundations to benefit from its high level of abstraction and its 
compositionality. By relying on different categories and functors, 
we ensure the navigation among the functionalities and allow the 
composition of multiples operations, while keeping track of the 
entire lineage of data. We also show how our framework can be 
applied on a simple example of data lake. 


CCS CONCEPTS 


- Information systems — Decision support systems; Business intel- 
ligence; Data management systems; Information storage sys- 
tems; - Software and its engineering — Architecture description 
languages; «Computer systems organization — Architectures; 
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Data Lakes, Category Theory, Architecture Pattern 
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1 INTRODUCTION 


Changes in the way of producing and consuming data have led to 
the emergence of new issues for information systems. Big Data are 
especially tied to 5 main concerns, usually referred to 5V: Volume, 
Variety, Velocity, Veracity and Value. Management of such data 
requires appropriate environments offering enough flexibility to 
support heterogeneous data with different models, but also multiple 
data analysis tools and complex pipelines or workflows. Traditional 
systems like data warehouses have thereby reached their limits due 
to their schema-on-write paradigm. Indeed, when using Extract- 
Transform-Load (ETL) processes to ingest data, the time-consuming 
and difficult data integration and cleaning steps require well known 
use cases, that hinder the flexibility of such systems. 

Data lakes have been proposed as flexible environments for stor- 
ing and analyzing Big Data [13] with a schema-on-read paradigm. 
In other words, raw data are usually directly stored in their original 
format and only processed when needed, but at the cost of complex 
model transformations that are supported by users. Different kind 
of metadata are proposed to help users to navigate, select, validate 
and transform data according to their needs. To a certain extent, 
it prevents from turning data lakes into data swamps [20], whose 
usability is reduced because of the difficulty to localize data. 

Data lakes have been described multiple times in the literature 
[9, 41, 44], but we follow Hai et als definition [25]: 


A data lake is a flexible, scalable data storage and man- 
agement system, which ingests and stores raw data 
from heterogeneous sources in their original format, 
and provides query processing and data analytics in 
an on-the-fly manner. 


This definition is fairly complete. It presents the data lake as an 
integrated system from the user’s perspective (for example a data 
engineer), exposing multiple functionalities. At a technical level, a 
data lake is in fact an architecture that brings together specialized 
components. 

Despite being defined as systems for both storing and processing 
data with management and analytical processes, few architectures 
of data lakes take all these aspects into consideration in a unified 
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manner [44]. This has led to the creation of alternative systems like 
Delta Lakes [2], Lakehouses [3] and Data Mesh [10], that focus on 
improving a subset of functionalities of data lakes. As a result, these 
systems have major drawbacks: they do not necessarily meet the 
functional requirements and they lack robustness. One of the main 
causes is that they are built as an assembly of isolated components 
having different behavioral properties, and thus it is difficult to 
ensure the validity of the expected properties for the whole system. 

However, the use of multiple components to support all the 
functionalities of the data lake is inevitable. Thus, the notion of data 
lake is in fact an architecture pattern in which the functionalities 
are well-defined. To avoid the data lake construction issues, some 
works narrow their system for a specific use case according to 
different domains [31, 34, 36, 40, 43]. We adopt a more abstract 
point of view, and aim to define a framework allowing to generalize 
the data lake pattern and to unify the component interactions. 

The greater control offered by robust systems is usually obtained 
through solid theoretical foundations [5, 29], such as Codd’s rela- 
tional model [8] and algebra for database management systems, 
which can provide formal frameworks for studying different prop- 
erties and their preservation or for building optimization processes. 
Data lakes, as pragmatic solutions originally designed to resolve 
industrial issues, were not defined at the time with strong theoret- 
ical foundations to describe and validate their functionalities, or 
to control their uses. Nevertheless, they could benefit from such 
framework. 

Category theory [15] is a meta-mathematical formalism intro- 
duced in the 1940s by Saunders MacLane and Samuel Eilenberg. It 
has already been successfully used for building formal frameworks 
in various domains of computer science like functional program- 
ming or software architectures. These domains especially benefit 
from the high level of abstraction and compositionality of this the- 
ory. Data lakes can greatly take advantage of these characteristics, 
as data and functions need to be represented altogether. Indeed, the 
schema-on-read paradigm requires enough expressivity to allow 
all kind of processing, but it also necessitates constraints in order 
to control data and metadata organization as well as their trans- 
formations and analysis. Category theory can provide all of these 
requirements. 

In this article, we propose to use category theory to build a for- 
mal framework allowing the interconnections among the different 
functionalities of a data lake, and unifying the levels of abstraction. 
It allows to compose the functionalities, and thus to keep track 
of the lineage of data, in order to give a formal structure to data 
lakes while coping with their need for flexibility. We also show the 
usability of the framework by applying it to an example of simple 
data lake. 

The remainder of this paper is structured as follows. First, we 
describe in section 2 the main functionalities of data lakes according 
to the literature, we present other formalisms previously proposed 
for data lakes and we show how the category theory can contribute 
to the formalization of these systems. In section 3, we introduce the 
main concepts of the category theory and our formal framework. 
We then use it to model a small example of data lake in section 3.3. 
Finally, we draw conclusions of our work and open up perspectives 
for the future in section 4. 
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2 STATE OF THE ART AND DISCUSSION 


In this section, after describing the main functionalities of data 
lakes, we argue that various levels of abstraction and dependencies 
among functionalities justify the need of a formalism allowing both 
abstraction and composition. In a second part, we study the major 
works regarding the formalization of data lake components and we 
show how category theory can be used in this context. 


2.1 Main Functionalities of Data Lakes 


Reviews of the literature [9, 25, 41, 44] agree on 4 main function- 
alities for data lakes, namely Data Storage, Data Ingestion, Data 
Maintenance and Data Exploration. 

Data Storage can take several forms in data lakes, from single 
systems handling data heterogeneity with generic models [16, 39] to 
polystores built as an assembly of specialized database management 
systems [4, 24, 27]. As datasets must be properly and accurately 
annotated with metadata so that the data lake does not become a 
data swamp [20], the data storage functionality also includes the 
storage of metadata. 

Data Ingestion provides tools for connecting the system to data 
sources, loading the data in streaming and/or batch manner and 
retrieving or producing basic metadata [19, 38]. Some data may 
require to only store aggregated insights to reduce their important 
volume, as it is the case for data streams of sensor data with high 
velocity. 

Data Maintenance ensures: 1) the usability of the data by orga- 
nizing the lake [1, 33] and by extracting more advanced metadata 
[4, 17, 27], for example through profiling or through the discovery 
of relationships among datasets; 2) the quality of data [16, 26], by 
guarantying or improving it, for example through the application 
of integrity constraints; and 3) the ease of use of the system by 
providing functionalities that make schema-on-read simpler and 
more efficient such as schema pre-integration [24]. 

Finally, Data Exploration functionality allows the discovery 
of content in data lakes through unified query interfaces [4, 24] 
or navigational algorithms based on measures of relatedness [17] 
or faceted search [27]. During the exploration phase, datasets are 
retrieved and integrated in an on-the-fly manner. Queries and algo- 
rithms can be applied on them in order to obtain results according 
to the case study. 

The description of the functionalities extracted from the liter- 
ature reveals two major characteristics of data lakes. Firstly, it 
brings to light different levels of abstraction related to: 1) data, 
including various models, formats and metadata; 2) software archi- 
tectures, with different strategies and components that can be used 
to store and process data; and finally 3) functionalities themselves, 
composed to implement other higher level services. Secondly, the 
definitions also reveal existing dependencies among the main 
functionalities. Data exploration depends on data maintenance 
and on data storage, data storage depends on data ingestion and 
finally data maintenance depends on data storage. 

The higher complexity induced by abstractions and dependencies 
is a strong motivation for building a formal framework able to 
ensure the robustness of data lakes and to control the interactions 
among components. 
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2.2 On Formalization of Data Lakes 


Only few proposals have been made to provide such formal frame- 
work for data lakes. Most of previous works have focused on the 
formalization of mostly isolated aspects of data lakes such as meta- 
data models, data storage, analytical queries, etc. 

Several models have been proposed for metadata modeling using 
UML, entity relationship or graph theory. In [45] the authors state 
that existing metadata models are either tailored for a specific use 
case or not generic enough to be used in different contexts. They 
extend their previous model called MEDAL (MEtadata model for 
DAta Lakes) to build a more generic one called goldMEDAL, defined 
on three levels: conceptual, logical and physical. The conceptual 
level is formalized through set theory and describes data entities 
and groupings, as well as hierarchy and lineage relationships. The 
logical level puts together the previous elements through graph the- 
ory concepts, especially nodes, edges and hyperedges. The physical 
level is finally implemented with the metadata framework Apache 
Atlas. The overall proposal of goldMEDAL is synthesized in a UML 
class diagram. 

In [42], a classification of metadata in two groups is proposed, 
with a special attention given to metadata related to data gover- 
nance concerns such as data access, quality and security. Metadata 
describing various relationships among datasets are classified as 
inter-metadata and metadata describing datasets themselves are 
classified as intra-metadata. The conceptual metadata model is rep- 
resented with a UML class diagram. A data lake architecture based 
on three zones is also represented but not formalized. This proposal 
has been later extended by the authors in [55] with a new analysis- 
oriented metadata model, also conceptualized through a UML class 
diagram. 

Finally for metadata models, ensemble modeling and more pre- 
cisely data vaults are used in [37] to create a model allowing better 
evolutivity for data and schema. At the conceptual level, datasets 
are classified inside satellites, logically abstracted inside hubs and 
finally associated inside links. The proposal is represented with a 
graph. 

The storage layer of semantic data lakes has been formalized 
with set theory in [11] as a tuple containing a set of data sources 
(datasets), a set of metadata catalogs describing the datasets with 
directed graphs, a global knowledge graph and a mapping function 
relating metadata to knowledge concepts in the global graph. In the 
same article, the authors also propose a set-theoretical formaliza- 
tion of analytical queries. They are defined as sets of indicators of 
interest measured along sets of dimensions of analysis. A response 
to such query is a set of metadata and transformation rules allowing 
the discovery of relevant data. 

In [25], the authors compare four formal schema mapping lan- 
guages based on tuple-generating dependencies (tgds), namely sim- 
ple tgds, nested tgds, second-order tgds (SO tgds) and plain SO tgds, 
as potential formal frameworks for integration tasks in data lakes. 
The different tgds are compared depending on their expressiveness 
as well as on the set of structural or reasoning properties they can 
ensure among the existence of universal solutions, closure under 
target homomorphism and allowing conjunctive query rewriting. 
SO tgds languages are identified as more expressive than the other 
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two but also less reliable on the properties due to their higher time 
complexity for model checking. 

Despite existing attempts to formalize parts of data lakes, a for- 
mal framework allowing a unified and complete representation of 
all the functionalities and levels of abstraction of such systems is 
still missing. Moreover, existing pieces of formal definitions are 
mainly based on semi-formal and descriptive models like UML or 
labeled graphs, which are not restrictive enough to guarantee math- 
ematical, structural and/or reasoning properties to the proposed 
model and to ensure their preservation in any subsequent concrete 
implementation [6]. 


2.3 Contributions of Category Theory to Data 
Lakes 


Category theory is an abstract meta-mathematical theory. It helps 
to reconcile the expressiveness of descriptive models and the restric- 
tiveness of mathematical ones. This formalism has already been 
successfully used to address some challenges of computer science, 
for example to build a general framework for the specification of 
concurrent systems [14] or to allow the compositionality of ma- 
chine learning components [46]. To the best of our knowledge, it 
has never been studied as foundation for a complete formal frame- 
work for data lakes. Nevertheless, some works tackle relevant issues 
to these systems with category theory. 

Related to the data storage functionality, schemas and data in- 
stances in relational databases have been modeled with small cate- 
gories and set-valued functors in [47], and constraints with functors 
and natural transformations later in [49]. Frameworks for object- 
oriented databases have been proposed in [30] and for document- 
oriented databases in [51]. The management of multi-model data 
and data integration issues are studied in [28, 32, 52] with some basic 
categorical tools like categories and functors as well as with more 
advanced one like pullbacks, pushforwards and kan lifts. Finally, 
metadata models based on category theory have been proposed 
in [7, 12]. 

On data maintenance, category theory has been mostly used to 
ensure data quality, through a metamodel adapted for geographic 
information systems in [18] and through a framework for variability 
models of software engineering in [35]. 

Finally, on data exploration, most of the existing works using 
category theory focus on query processing. In [22, 23], monads 
have been proposed as representation for queries, and monad com- 
prehension is used for query processing. A query language imple- 
mented with the functional programming language Haskell, based 
on functors and using natural transformations for optimization is 
presented in [50]. Category theory also serves as basis for a frame- 
work of a search meta-engine in [53]. Other issues related to the 
data exploration fonctionality of data lakes include data and schema 
integration, which has been tackled in [48] with functorial data 
migration operations, and the creation of data visualization, which 
has been formalized in [54]. 
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2.4 Synthesis 


Data lakes have been detailed several times in terms of functionali- 
ties, but a formal definition expressed through a theoretical frame- 
work is still missing, and existing proposals either lack expressive- 
ness or restrictiveness. Category theory is a promising candidate 
as foundation for such framework and has already been used in 
several relevant contexts for data lakes. A categorical definition 
unifying all the main functionalities of data lakes should provide 
the formal framework needed to improve them with mathematical 
properties and theorems, while creating a bridge with the previous 
works using the same formalism. 


3 FORMALIZATION 


In this section, after a brief introduction to category theory, we give 
a high level theoretical description of the main functionalities of a 
data lake presented in section 2.1. We show how composition and 
abstraction can be used to define different levels of representation 
and how different types of functors can ensure the navigation 
among them. We explain how our framework can be used to check 
the validity of an implementation of a data lake. We finally illustrate 
this on an example in subsection 3.3. 


3.1 Category theory in a nutshell 


Category Theory describes structures as categories and relations 
between them with functors. 

A category C is defined by a collection Ob(C) of objects, a col- 
lection Hom(C) of directed relations between these objects, called 
morphisms, and a binary associative operation (noted 0) to com- 
pose morphisms. The sub-collection of morphisms between an 
object x (called domain) and an object y (called codomain), both 
in Ob(C), can be expressed as Homc(x, y), and a morphism f be- 
tween these objects is noted f : x — y. Each object x € Ob(C) 
is associated with an identity morphism id, : x — x, acting as 
neutral element with o. 

A category C is said to be locally small if Hom(C) is a set, 
small if Ob(C) and Hom(C) are both sets and large otherwise. 
There is also a large category Cat defined with all objects as small 
categories and with all morphisms as functors between them. 

A functor F : C — Disa structure mapping the objects and 
morphisms of a category C to objects and morphisms of a category 
D. Functors preserve identities, that is Vx € Ob(C), F(idy) = 
idp(x), and preserve composition, that is Vf: x > y,g:y— 
z,F(go f) = F(g) o F(f). 

A constant functor Ac_p : C > D isa special mapping that 
collapses every object in Ob(C) to a single object d € Ob(D) and 
every morphism in Hom(C) to the identity morphism idg. Sur- 
jective functors act on every not empty Homp(x, y). A functor 
F : C > Dis said to be surjective if for every x,y € Ob(D) and 
every morphism in Homp(x, y), there is at least one morphism 
in Homc(F !(x), F~!(y)) (surjective mapping on every morphism 
of D). 

A product of two categories C1 and C2 produces a new category 
whose objects are all the possible pairs (x, y) with x € Ob(C1) and 
y € Ob(C2) and morphisms (x, y) — (x’, y’) are pairs (f, g) where 
fi: x7 x’ €Homc;(x,x’) andg:y > y’ € Homca(y,y’). A 
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bifunctor has a product of categories as domain, and a category 
as codomain. 


3.2 Categorical Framework for Data Lakes 


In this subsection, we propose a high level categorical framework 
for data lakes, based on the description of the main functionalities 
established in the literature and presented in section 2. 

At the highest level of abstraction, a data lake can be seen as a 
large category DL. This category is more precisely defined in ta- 
ble 1 and is visually represented in figure 1. Its collection of objects 
Ob(DL) includes three of the main functionalities identified in sec- 
tion 2, namely Data Storage, Data Ingestion and Data Exploration. 
These objects are themselves categories. The Data Maintenance 
functionality has a specific representation. As it mainly transforms 
data to improve their usability, it is represented by a bifunctor 
Storage x Storage — Storage. It allows to define maintenance op- 
erations that can have a single dataset along with its metadata as 
input, but also operations that take two datasets with their metadata 
as input (such as the discovery of relationship between entities). 


maintenance 


Legend: Category Object) —morphism> 
Figure 1: Representation of the category DL 


The figure 2 and the tables 1 and 2 synthesize the following 
statements. The reader can use them as an helping guide of lecture. 

The Data Ingestion functionality represents the entry point for 
data into the data lake. To do so, its Ingestion category contains 
three objects: raw_data for data as they are from their source, 
dataset for data as they enter into the data lake system and metadata 
that are extracted from dataset. The morphisms in this category 
show the different steps of data ingestion, i.e., we load the raw_data 
into the system to create a dataset, this dataset can be trans formed 
(for example to compute aggregated data or to add some lightweight 
information such as the timestamp of the entry in the data lake), 
and some metadata can be extracted from this dataset. 

The dataset and its metadata must be stored into the data lake. 
This functionality is supported by the Storage category. It is a 
category, in which objects represent data at different levels of ab- 
straction, linked by morphisms. The different levels considered are 
the physical one with md_system and d_system, the logical one 
with dataset, the conceptual one with metadata and the structural 
one with md_model and d_model. The morphims link a dataset 
to its metadata, and the dataset and metadata to their respective 
model. The different models are themselves linked to their storage 
system. 
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Name Objects Morphisms (without identity morphisms) 
DL storage, ingestion, store : Ingestion — Storage 
exploration explore : Storage — Exploration 
maintenance : Storage X Storage — Storage 
Ingestion raw_data, dataset, load : raw_data — dataset 
metadata transform : dataset — dataset 
extract : dataset — metadata 
Storage metadata, dataset, described_by : dataset — metadata 
d_model, d_system md_modeled_by : metadata — md_model 
md_model, md_system md_stored_in: md_model — md_system 
d_modeled_by : dataset — d_model 
d_stored_in: d_model — d_system 
Exploration catalogue, dataset localize : catalogue — dataset 
metadata query : dataset — dataset 
algorithm : dataset — dataset 
described_by : dataset — metadata 
Table 1: High level categories of the framework 
Name Type Elements in Domain | Elements in Codomain 
store Ingestion — Storage raw_data dataset 
dataset dataset 
metadata metadata 
load idgataset 
transform idgaraset 
extract described_by 
explore | Storage — Exploration dataset dataset 
d_model dataset 
d_system dataset 
metadata metadata 
md_model metadata 
d_system metadata 
described_by described_by 
d_modeled_by idgataset 
d_stored_in idgataset 
md_modeled_by idmetadata 
md_stored_in tdnietvdaia 


Table 2: Functors used in the framework 


Once the data are stored, they can be accessed in order to be ex- 
plored. It can be done with the Exploration category. The catalogue 
allows the localization of a dataset, through its metadata. It is pos- 
sible to run some queries or algorithms on the retrieved dataset in 
order to get the desired result. 


With this representation, constant functors can link different lev- 
els when lower level categories are embedded in objects of higher 
level categories. For example, the refinements carried by the Stor- 
age category are all embedded in one object of DL, namely Storage. 
A constant functor Astorage-DL : Storage — DL can be used to 
link the lower level category to the higher level one. Functorial laws 
are satisfied because every object and every identity morphism in 
the domain category is mapped to the same object and its identity 
morphism in the codomain category (identity preservation) and be- 
cause identity morphisms can always be composed with themselves 
(composition preservation). 

To use the categories of the framework with an instance of a data 
lake, each implementation of a functionality must be represented 
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in a category, that must be linked to its higher-level correspond- 
ing category with a surjective functor. The surjective condition 
ensures the respect of the proposed framework and its structure, 
while allowing more complete descriptions of functionalities in the 
category of the implementation. Furthermore, the morphisms of the 
DL category add a constraint that forces the existence of functors 
between the categories concerned by each morphism when they 
are represented as objects in the DL category. 


To ensure the navigation between the functionalities of the 
data lake, functors exist between the Ingestion and the Storage 
categories, and between the Storage and the Exploration categories. 
The defined functors and morphisms force the direction of the 
different transformations, and avoid the scattering of data. As the 
categories that will be defined for the implementation must be 
linked to their corresponding higher-level category, this forces also 
the implementation to provide these functors between the different 
implemented functionalities. 
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Figure 2: Representation of the categories Ingestion, Storage and Exploration, and the functors that link them 


By relying on their constraints, such as the preservation of iden- 
tities and the composition of morphisms, functors allow to keep 
track of data from their loading into the data lake to the produc- 
tion of results from a dataset in the exploration phase. Indeed, all 
the objects and the morphisms of the domain category must be 
sent to the codomain category. So, no information is lost when 
switching from a functionality to another. 


The maintenance functionality takes a different form in the 
formalization framework. As it aims at improving the quality of 
the datasets or the metadata in order to ease their use during the 
exploration phase, two major applications of a maintenance oper- 
ation can be found: improving a dataset or its metadata directly 
from themselves or improving a dataset or its metadata by rely- 
ing on another dataset along with its metadata. Within the cate- 
gory theory, it is possible to use bifunctors to cover both of these 
applications. To do so, the bifunctor maintenance is defined as 
Storage x Storage — Storage. The first Storage category corre- 
sponds to the one on which the maintenance operation is applied, 
the second one corresponds to the one that will bring the elements 
required to the improvement (it can be the same Storage category 
as the first one when it is used to improve itself). The last Storage 
category is the result of the maintenance operation, and can be 
seen as the evolution of the first Storage category. 

The figure 3 gives an overview of this mechanism. The objects 
(dataset, metadata) and (metadata, dataset) can be mapped either 
on the object metadata or on the object dataset of the resulting 
Storage category depending on the effect of the maintenance oper- 
ation. We detail a little more this mechanism with an example in 
the following section. The other associations of objects are omitted 


for the sake of readability. 


This representation allows the composition of multiple main- 
tenance operations, while giving freedom to apply them in any 
order and to any data. Thus, the maintenance operations can be cho- 
sen according to the individual needs of each dataset. Furthermore, 
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it has also constraints, as a maintenance operation can only be 
applied on existing datasets and metadata. This contributes to 
the identification of dependencies among datasets and metadata. 

So, in its globality, this framework keeps track of the entire 
lineage of data. It is possible to know the source of data, how 
they have been transformed and the relationships that exist among 
datasets. It also imposes constraints on the validity of transforma- 
tions applied on data, and on the switching from a functionality to 
another. Furthermore, operations can be composed, mainly with 
the maintenance functor, in order to allow a high flexibility that is 
essential to data lakes. 


3.3 Example 


We propose to show the usability of our categorical framework by 
defining a small data lake. We rely on the following use case: an 
enterprise has data about their customers, and records their online 
activity on the enterprise web applications. 

The instances of the Ingestion and Storage categories are rep- 
resented in figure 4. Regarding the data of the online activity, the 
enterprise is not interested by all the data, so it performs an ag- 
gregation operation in order to store only a summary of the data 
(category Ing_ds1). The data of the customers are easier to handle. 
As they are extracted from the enterprise information system, they 
do not need any transformation before being stored into the data 
lake (category Ing_ds2). 

Once the data are ingested, they are stored in the data lake. For 
the activity data, the dataset is modeled as time series and InfluxDB 
is used as storage system. For the metadata, they are modeled as 
graphs in Neo4j (category Str_ds1). Regarding the customer data, 
the metadata are stored in the file system as JSON format, and the 
dataset in the relational PostgreSQL database (category Str_ds2). 

The table 3 states the effects of functors on the objects and 
morphisms from the instance categories of the figure 4 to the cor- 
responding high-level categories Ingestion and Storage. Thus, the 
ingestion and storage functionalities of the implemented data lake 
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Figure 3: Partial representation of the maintenance functor, in which the immutable parts are represented 
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Figure 4: The ingestion and storage of the two datasets 


satisfy the requirements of the formalization including the surjec- 
tivity condition on functors. 

Once the two datasets are stored, a maintenance operation can 
be applied on them (figure 5). In this operation, the dataset of 
the temporal activity is enriched with the data about customers 
(category Str_ds1_v2) in order to gain more information regarding 
their different characteristics, for example their country that can be 
used to make the typical hours of activity more precise. With this 
enriched dataset, an exploration is performed, first to reduce the 
dataset on a given period, and then to execute an anomaly detection 
algorithm to reveal fraudulent uses. 

This example demonstrates how the category theory supports 
the navigation across abstraction levels and how properties on 
functors constrain the implemented functionalities to comply to 
the structure defined in the corresponding higher-level categories. 
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Moreover, the framework allows to represent all the potential func- 
tionalities of a data lake, and thus to use it to check the validity of 
an implementation of a data lake. 


4 CONCLUSION 


In this article, we have proposed a unified formal framework for 
data lakes based on category theory. The navigation between the 
different functionalities is controlled by functors and compositions, 
that allow to keep track of the lineage of data while providing the 
flexibility required by these systems. The levels of abstraction of the 
data lake are linked with constant or surjective functors, that ensure 
the validity of implementations of data lakes. We have shown on 
an example how the framework can be used. 

Unlike previous works on the formalization of data lakes, our 
proposal considers a unified and complete view of all the main 
functionalities identified by the literature, namely Data Ingestion, 
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Figure 5: A maintenance operation followed by an exploration 


Data Storage, Data Maintenance and Data Exploration. Thanks to 
the expressiveness and restrictiveness of category theory, we have 
also been able to represent and control the various dependencies 
and levels of abstraction existing in data lakes. Category theory 
additionally creates a bridge with existing works of the literature 
using the same formalism, allowing their use as refinements of 
some higher level aspects described in our framework. 

As perspectives for future works, we plan to extend our for- 
malism to allow the definition and control of complex and hybrid 
workflows for accessing and querying data in data lakes. Such 
workflows are indeed very important for these systems, in which a 
variety of operations can be executed in the same environment like 
for example operators of relational algebra, machine learning tasks 
based on linear algebra, user-defined functions, etc. These various 
operations could therefore be unified by expressing them through 
categories, functors and bifunctors and then linked to the rest of 
the framework. We also plan to introduce the physical level of com- 
ponents in the data lake architecture by mapping the implemented 
functionalities to their corresponding component through functors. 
With this configuration, we can rely on a previous work [21] that 
allows to check the conservation or the loss of technical properties 
in an architecture with the category theory. We also think about 
exploring more the capabilities of the maintenance functor, that 
could be used to control the models of the dataset and the metadata 
depending on their original models with the (d_model, d_model) 
object of the product of categories. It can serve to detect model 
transformations that will lose precision compared to the original 
model. 
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described_by 
d_modeled_by 
d_stored_in 
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md_stored_in 


described_by 
d_modeled_by 
d_stored_in 
md_modeled_by 
md_stored_in 
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metadata metadata 
JSON md_model 
file_system d_system 


described_by 
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md_stored_in 


described_by 
d_modeled_by 
d_stored_in 
md_modeled_by 
md_stored_in 


Table 3: Functors used in the example, between the category 
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ABSTRACT 


The novelty of the paper is that we introduce extended relational 
schemas defined by context-free languages. This allows us to spec- 
ify schemas that include nested relations. The representation of 
regular expressions with the graph made it possible to specify both 
relational algebraic operations and dependencies. This paper uses 
extended context-free languages to define schema graphs, con- 
structed from the regular expressions specifying the right side of 
the production rules. This is exploited by representing context-free 
grammars with recursive, increasing graphs. Next, we introduce 
the use of a context-free language to specify extended relational 
schemas with ordinary and nested attributes. In the derivation rules 
of the grammar, terminal symbols give ordinary attributes, while 
non-terminal (linguistic) symbols give nested attributes. Regular 
expressions can use both terminal and linguistic symbols. We can 
define schemas with finite derivations of rules. In the accepted sym- 
bol sequences, the terminal symbols give the ordinary attributes 
and the language symbols the nested attributes. For a nested at- 
tribute we must assign a schema with the appropriate rule. No 
embedded attribute may remain at the end. The relational row type 
and the finite occurrence of rows of this type can be defined as 
instance for the resulting schema. We define set operations and 
nested database operations as well. Functional dependencies can 
be defined and examined at multiple depths. 
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1 INTRODUCTION 


XML has been practically the standard format of data exchange 
over the World Wide Web. In XML, you can specify a structure 
type by a formal language. Instances of a particular type of XML 
Element are can be considered a data collection. The declaration 
of a DTD Element consisting of simple values can be considered 
as a general relational schema definition. An instance from the 
extended schema is a row type from the language given by the 
regular expression. Occurrences of an Element corresponding to a 
row type of a relation make a relation instance. The declaration of 
a DTD Element consisting of simple values can be considered as a 
general relational schema definition. Such an extension is included 
in [2], [3]. 

This can be extended to complex Element declarations given 
by a context-free grammar. For this, we use two types of Element 
names in the DTD. Simple type, like in the regular case, can only 
take simple values. The other is the complex type, for which the 
regular expression of the declaration specifies a choice of row type, 
where simple and complex element names can be included in the 
allowed row type. The set of Element declarations in the DTD thus 
represents a context-free language, where simple names are the 
terminal symbols and complex names are the non-terminals. For 
each complex element we can associate a language built in parallel. 
A sentence from these languages consists of a list of simple element 
names and specifies a row type. 

The use of the resulting regular schemas is illustrated by the 
graph of the finite-state automation that can be given directly 
from a regular grammar, or as it is given in [3], by graphs that 
can be directly assigned to regular expressions. This formal lan- 
guage method, which specifies the schema system, can be applied 
to context-free languages. The first option seemed to be to deter- 
mine context-free languages with graphs of recursive finite state 
automata. This is a less usable representation than the direct use of 
graphs of regular expressions in the version of context-free gram- 
mars given by regular expressions. We build on this idea when 
introducing nested relational extended schemas based on the use 
of context-free languages. 

XML was originally defined for describing and presenting in- 
dividual documents, but it has been used for building databases 
too. Because of the use of XML as database model, one needs XML 
integrity constraints, and XML functional dependency concepts. 
The main problem with defining functional dependency in the XML 
context is the lacking "tuple" concept for XML. An instance of a 
relational schema is a set of tuples, and one can easily select pairs 
of tuples from this set for comparing in order to check whether 
the instance satisfies a given functional dependency defined on 
the relational schema. In the XML world, there is no general ac- 
cepted definition for the concept of tuple, and even if one chooses 


84 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


a collection of elements and declares them a "tuple", it is very hard 
to find a proper matching algorithm for them. Arenas and Libkin 
defined "tree tuples" in their seminal work [4], based upon DTD 
schema. Vincent et al. [5] described some cases, not covered with 
"tree tuples", and invented the notion "closest node" to deal with 
them. They defined functional dependency on XML trees without 
any schema, and used DTDs just to prove that their definition is 
equivalent with "tree tuples" for some classes of DTDs. 

All XFD concepts are very intricate, compared with the classical 
functional dependency concept for relational databases. In the case 
of XML data model, they base mostly upon path expressions. 

A new functional dependency concept, regular FD has been 
proposed recently [2], applicable for data models, those extended 
"tuples" are sentences from a regular language. The main motivation 
was to find a simple, but general definition of functional dependency 
for a broad family of data models: the only assumption was that 
the "tuples" should be sentences of a given regular language (i.e., 
they should be generated by a regular grammar). 

In this paper, we rephrase the concept of regular relational 
schema to the schema graph that can be constructed for extended 
context-free grammars. The schema graph can be used to expound 
instances of relational databases with complex values and some 
algebraic operations as well. Using the schema graph we can define 
scoped functional dependency on extended context-free languages. 


2 GRAPH REPRESENTATION FOR FORMAL 
LANGUAGES 


We want to generalize the definition of relation for a broad family 
of data models: the only assumption is that the "tuples" should be 
sentences of a given formal language (i.e., they should be generated 
by a grammar). Our more general new model makes use of formal 
languages to specify tuple” types, these schemas will be given as 
sentences of a formal language. In order to realize this aim, it is 
helpful to create graph representation for formal languages. A regu- 
lar language can be represented by the graph of the corresponding 
((non)deterministic) finite state automaton (FSA). Alg. 1. in [2] can 
create this automaton from the grammar of the language, If the 
regular language is given by a regular expression, then there exist a 
great number of algorithms for the efficient construction of a finite 
automaton from a given regular expression. There are two main 
types of them according to the working-method of the resulting 
state machines: non-deterministic (NFA, e.g. Glushkov automaton) 
and deterministic (DFA, e.g. Brzozowski’s construction). The clas- 
sical algorithm of Berry and Sethi [6] constructs efficiently a DFA 
from a regular expression when all symbols are distinct. We use 
here another algorithm to construct a graph representation from a 
regular expression ( [3]) 

A graph representation of a regular language is an edge-, or 
node-labeled directed graph with one entry point and one exit 
point. Routes from the point of entry to the point of exit are called 
traversals. The series of labels in the traversals make up the lan- 
guage. 

For regular languages, we can obtain a graph representation 
in two ways. For a language given by a regular grammar we can 
directly construct the corresponding finite state automaton. In the 
other case where the language is given by a regular expression we 


IDEAS2022 


Andras Benczur 


can construct an edge-labeled graph directly from regular expres- 
sions. One such construct is the [6] graph, the other is the graph in 
the [1]. 


2.1 Graph Representation for Regular 
Expressions 


Definition 1. (Regular Expression Syntax). Let X be a finite set of 
symbols (alphabet), then a regular expression RE over X (denoted by 
REx, or simply RE, if X is understood from the context) is recursively 
defined as follows: 

RE::=0 | 1| @ | RE + RE| RE° RE | RE*| RE’ , where a is in X. 

The regular expression RE generates the regular language L(RE). 
L(0) is the empty language, L(1) is the language consisting of the 
empty string e€ alone. Note that 0 and 1 are not symbols from the 
alphabet X: 0 represents the empty regular expression, 1 represents 
the empty string € 

We need a construction for the graph representation of regular 
expressions. We will construct a graph from vertices picked from 
a suitably large symbol set I. We assume that {IN,OUT} ¢ I and 
by picking a node v € I we remove it from I. The vertices IN and 
OUT get the labels IN and OUT, respectively. 


Algorithm 1 Construction of the Graph-Representation for a reg- 
ular expression according to Def. 1. 


Input: regular expression RE (built from the alphabet >), 

Output: vertex labeled digraph G(RE)=(V,E) representing RE. 

if RE=0, then V= @ and E= @. 

if RE=1, then V={IN,OUT} and E={(IN,OUT)}. 

if RE=A,A € &, then we pick a node v € T, set V={IN,OUT,v}, and 
E={(IN,v), (v.OUT)}. We label the node v with A. 

if RE; and RE2 are regular expressions, then G(RE;+RE2) will be 
formed by uniting the IN and OUT nodes of G(REj) and G(REz2), 
respectively. 

if RE; and RE2 are regular expressions, then in order to build the 
graph G(REj ° RE2) we first rename the OUT node of G(RE1) and 
the IN node of G(RE2) to JOIN, then unite them using the JOIN 
node as a connecting switch in order to get a more compact graph 
(Fig. 1). 

if RE is a regular expression and G(RE)=(V,E), then G(RE’)=(V,E U 
(IN,OUT)). 

if RE is a regular expression, then in order to build the graph 
G(RE") we first pick a node v € I, then we create the graph 
G*(RE)=G(RE) U {v} (It means that V*=V U {vy}, the node v gets the 
special label STAR). Let us denote {ay,...,an} the nodes with ingoing 
edge from IN and {z}....,2n} the nodes with outgoing edge to OUT, 
respectively. Let us create the graph Gy (RE,STAR)=U7 (v, ai) and 
the graph Gout (RE,SSTAR)=U7(z;, v), respectively. Then 
G(RE*)=G*(RE) N Gry(RE,STAR) U Gout (RE,STAR) U (IN,STAR) 
U (STAR,OUT. 


Theorem 1. ( [1], [3]) 

The (IN...... OUT) traversals on the graph representation G(RE) for 
the regular expression RE constructed by Alg. 1. generate exactly 
the regular language L(RE) over the alphabet > of RE. 
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Figure 1: Using the JOIN node as connecting switch for the concatenation of two RE graphs 


2.2 Graph Representation for Context-free 
Languages by RSM Extension 


In the case of context-free languages, specifying a graph represen- 
tation is a complex task. It is not sufficient to specify a single graph 
but we need to give possibly several graphs that call each other 
recursively. Such a representation can be find in [7], the graphs of 
the so-called recursive finite state machine. An interesting use of 
RSM representations can be found in [8]. The authors use graph 
operations and matrix algorithms in evaluation of context-free path 
queries. 

Next, we introduce recursive state machines (RSM) from [8]. 

This kind of computational machine extends the definition of 
finite state machines and increases the computational capabilities 
of this formalism. 

A recursive state machine R over a finite alphabet = is defined 
as a tuple of elements (M,m, {Ci};e)4) where: 

-{M} is a finite set of labels of boxes. 

-m € Mis an initial box label. 

-Set of component state machines or boxes, where Cj = 
(ZUM, Qi, q. Fi, 6i) 2 

- UM is a set of symbols, 29 M = 0 

-Q; is a finite set of states, where Qi N Qj = 9, Vi # j 

- q is an initial state for the component state machine C; 

-F; is a set of final states for Cj-where F; € Q; 

-6j is a transition function for Cj, where 6; : Qj x (ZU M) > Qi 

RSM behaves as a set of finite state machines (or FSM). Each 
FSM is called a box or a component state machine [7]. A box works 
almost the same as a classical FSM, but it also handles additional 
recursive calls and employs an implicit call stack to call one com- 
ponent from another and then return execution flow back. RSMs 
are equivalent to context-free languages. 

A version of the RSM construction will be given in the following. 
To do this, we start by introducing a special form of context-free 


IDEAS2022 


grammars. Each language symbol is assigned a single regular gram- 
mar whose terminal symbols are the terminal and linguistic symbols 
of the original grammar. 

Definition 2. 

G =\N, T, S, P) is a context-free grammar given by regular 
languages, where T is the set of terminal, N is the finite set of non- 
terminal symbols, P is the set of production/derivation rules, and S 
is the sentence symbol. The right side of production rules are given 
by regular languages: for all p in P, p= {A => w| win L(Ga), Ain 
N, Ga = {N UT, Na, Pa, Sa} is a regular grammar}. 

During the construction of graph representations for each lan- 
guage symbol, we shall build the finite state automata with ex- 
pansion iteration. We will get graphs that have edges labeled by 
element of T and N. The restriction of an F graph on the terminals 
is denoted by TF. 

First, for each A € N, we construct a finite-state automaton for 
its rules, Ma, with Start, and End, states of entering and leaving 
respectively. 

Let’s start with the graph Ms. 

The first sub-language, L('Ms), is obtained by erasing all edges 
with non-terminal labels and taking the language generated by 
the remaining automaton. If this language is not empty, then each 
sentence can be directly produced from S, so it is an element of the 
language L(G). 

Iteration step: We have constructed the F; graph and the corre- 
sponding L(*F;) language. Choose an edge (p, q) in Fy with an AEN 
label. Insert the graph p > Ma — q parallel with the A edge, where 
Ma is the graph of the last extended automaton of A. Do this for all 
occurrences of edges with label A. This will be the F;,; graph. The 
new language will be L(1Fi41). Obviously L(“F,,1) contains L("F)). 
The vertex labels of graphs are defined by the non-terminal symbols 
of regular grammars in the rules. Repeated use of vertex labels is 
not a problem with pastes. For example, the language L(G) can be 
produced as an expanding series of regular languages generated by 
extending the graph Ms. The procedure is not deterministic, but it 


86 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


can be made deterministic by always inserting new Ma graphs ina 
cyclic order of N. 

For each regular grammar Ga, we can build the extensions of the 
RSM graph to produce the corresponding language over terminals 
N. The notation of these languages is L(Ma). 

We can generate regular expressions equivalent to Ga grammar 
and we denote them by Ea(NUT). By writing the language L(Ma) 
in the place of A in a regular expression, we get regular expression 
above the languages. Formally, by using the mapping f(A)=L(Ma), 
we get the expression E(f(N) U T) over the languages {L(Ma)| AEN} 
from the expression E(NUT). The notation f(N) is interpreted as 
meaning that any element of the language L(Ma) can be substituted 
in place of the occurrence of A in the regular expression. 

Theorem 2. The languages L(G)= L(Ms) and {L(Ma)| AEN} sat- 
isfy the following system of equations: 

L(Ma)= Ea(f(N) U T), for all AEN. 

Proof. 

Let we take a specific rule A => weL(Ga), where w = vBz, and 
suppose that exists x €L(Mg) such that vxz € L(Ma). There is a 
traversal of the graph Ma according to w, including the B edge with 
non-terminal label. Any graph can be written in place of the edge B, 
which can be obtained during the extension of the graph Mg. This 
means that any sentence of the language L(Mg) can be inserted 
here as a path, i.e. the whole language L(Mpg) can be inserted in 
place of B. 


2.3 Graph Representation for Extended 
Context-free Languages 


There are other ways to get to infinite graphs by specifying the 
set of rules for individual non-terminals by regular expressions 
instead of regular grammars. In Alg. 2. we shall use the graph 
constructed by Alg. 1. for the regular expression. You can insert the 
appropriate graph in place of the non-terminal vertices, so you can 
get newer, longer traversals. All the traversals through the infinite 
graph obtained by infinitely extending the graph of the sentence 
symbol again give the sentences of the language. 

An extended context-free language (ECFL) is generated by 
an extended context-free grammar (ECFG). An ECFG is a tuple 
G=(N,TR,P), where 

N is the (finite) set of non-terminal symbols, 

T is the (finite) set of terminal symbols, and N 1 T = ¢, 

P is the set of production rules of the form A => Ra, where A € 
N, Ra is a regular expression over N UT, 

R € N is the start symbol. 

For each production rule A => Ra the regular expression Ra 
denotes a regular language La C (N U T)*, the corresponding 
graph-representation can be constructed according to Alg. 1., using 
as input alphabet = = N U T. During derivation by the grammar 
G we substitute the non-terminal A by a sentence from La, so the 
vertex-labels set by Alg. 1. will be picked from either N or T. In order 
to demonstrate the path that leads to the attributes of a schema we 
use the list of non-terminal symbols that will be used during the 
derivation process leading to a terminal symbol. In the first step 
we use the start symbol R as the beginning. In order to create a 
graph for the extended context-free language L(G) generated by 
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the grammar G we should repeat the construction for all vertices, 
labeled with a symbol X € N. 


Algorithm 2 Construction of the infinite graph for an extended 
context-free language. 

Input: an ECFG G=(N,T,R,P), 

Output: vertex labeled digraph SCH(G)=(V,E) representing the 
graph for G. 

Step 1. We apply Alg. 1. for the RE Rp, using N U T as alphabet = 
and vertex-label R-Y in state of symbol Y (Y € N U T). We denote 
the created graph as Gj. If all vertex-labels in G; have a terminal 
symbol as the last component, then the construction is ready and 
SCH(G)= Gy. If there are vertices in Gy with non-terminal symbol 
as the last component of their label, then these vertices will be 
bracketed: in the ingoing edge of the vertex we insert a node with 
label in, and in the outgoing edge a node with label out, 
respectively. These in and out labels do not have dotted form. (see 
Fig. 2) 

Step k. (k > 2). If all vertex-labels in Gx.; have a terminal symbol 
as the last component (ordinary attributes), then the construction 
is ready and SCH(G)= G x1. Otherwise, we pick a vertex whose 
label ends to a non-terminal symbol, say Z. This vertex has been 
(in, out) bracketed during the construction Step k-1. We apply 
Alg. 1. for the RE Rz, using vertex-label R...Z-Y in state of symbol Y 
(Y € NUT). We select from the created graph a sub graph that 
consists of complete (IN,. . .OUT) paths. Let this sub graph be G7. 
The vertices in G2 with non-terminal symbol as the last 
component of their label will be bracketed similarly as described in 
Step 1. (see Fig. 3. and Fig. 5). Let the vertex v in G ,.-1 be the 
picked one, labeled with the non-terminal Z. Let (vjn,v) and (v, 
Vout) be the ingoing and outgoing edges to v (vin has the label in, 
Vout has the label out), then we unite the node IN of G2 with the 
node vjy and the node OUT of G4 with the node Vout, respectively. 
The united vertices will have the labels in and out, respectively 
(see Fig. 4. and Fig. 6). Let the united graph be Gx. 


The sub process finishes for a given non-terminal when all vertex- 
labels terminate in terminal symbols. The aim of the sub process 
is to construct a finite substitution tree. All leaf nodes need to be 
terminals. 

Remark 1. 

During processing, Alg. 1. can insert the sub graphs of all current 
nested attributes (if any) in a single step (Fig. 5. and Fig. 6). 


3 NESTED RELATIONAL SCHEMA GRAPHS, 
RELATIONAL INSTANCES AND 
OPERATIONS 


3.1 Nested Relational Schema Graphs and 
Schema Instances 


We extend the definition of schema graph for regular expressions [1] 
to extended context-free languages. Using the notation of the ECFG 
from 2.3, we associate ordinary attribute names for the terminals 
T and nested attribute names for non terminals N. The extended 
language EL(G) consists of all paths of the generated graph by Alg. 
2: 
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P=(R=>(ab(A+B)*),A=>(cB*),B=>(dA*)} 


G, 


Figure 2: Current generated graph in construction of a schema graph 


P=(R=>(ab(A+B)*),A=>(cB*),B=>(dA*)} 


G® 


Figure 3: Sub graph for a nested attribute in construction of a schema graph 


P=({R=>(ab(A+B)*), A=>(cB*),B=>(dA*)} 


Figure 4: Next generated graph in construction of a schema graph 


For each production rule A => Ra the regular expression Ra 
denotes a regular language L4 C (N U T)*, the corresponding 
graph-representation can be constructed according to Alg. 1., using 
as input alphabet = = N U T. During derivation by the grammar 
G we substitute the non-terminal A by a sentence from La, so the 
vertex-labels set by Alg. 1. will be picked from either N or T. In order 
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to demonstrate the path that leads to the attributes of a schema 
we use the dotted list of non-terminal symbols that will be used 
during the derivation process leading to a terminal symbol. The 
finishing (terminal) symbol is an ordinary attribute for the relation; 
the previous (non-terminal) symbols are the nested attributes. In 
the first step we use the start symbol R as the beginning (0-th) 
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P=(R=>(ab(A+B)*),A=>(cB*),B=>(dA*)} 


G* 

IN OUT 
in out is 

a me ee oe oe 

G? 


OUT 


IN 
im out = 
Or oa cn, a oe 


Figure 5: Sub graphs for all current nested attributes in con- 
struction of a schema graph 


nested attribute. In order to create a schema graph for the extended 
context-free language L(G) generated by the grammar G we should 
repeat the construction for all vertices, labeled with a symbol X € N. 
This serial of steps presents a given derivation choice of schemas. 
We can select either ordinary or nested schemas. 


Algorithm 3 Selection of a Schema from the Schema graph 


Input: vertex labeled digraph SCH(G)=(V,E) representing the 
schema graph for an ECFG G=(N,T.R,P), 
Output: a schema that may contain nested relational attributes, but 


their deepest schemas (leaf schemas) contain exclusively ordinary 
attributes 

Step 1. 

We select a traversal from SCH(G), which may contain 
non-terminal symbols. The non-terminals will be the nested 
attributes, the terminals the ordinary attributes. In an instance for 
the schema the ordinary attributes represent simple values the 
nested attributes take complex values (Def. 6.). 

Step 2. 

For each non-terminal we select a traversal from its graph, which 
gives the schema of the nested relation. 

Step 3. 

For all non-terminals in each relational schema we repeat Step 2. 
Step 4. 

The process finishes when all nested relational schemas contain 
exclusively terminals. 

We call a generated scheme by Alg. 3. a legal schema. 


Example 1. Let G ({R,A,B} {a,b,c,d}.R,P) be an extended context- 
free grammar, where P={R=>(ab(A+B)*), A=>(cB*), B=>(dA*)}. 
There are the following possible generated (ordinary and nested) 
schemas: 

R(abc), R(abd), R(abA[cd]), R(abA[cB[d]B[dcc]) 

R(abc) gives the ordinary schema R(a,b,c), in dotted form 
R(a,b,A.c) 

The longest one is R(abA[cB[d]B[dcc]), generates the ordinary 
schema R(a,b,c,d,d,c,c), in dotted form R(a,b,A.c,A.B.d,A.B.d, A.B.A.c, 
ABA). 
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See the resulting graphs on Figures 2-7. 

Example 2. Schema graph for the ECFG given in Example 1. 

Let G ({R,A,B},{a,b,c,d},R,P) be an extended context-free grammar, 
the production rules are given in Fig. 8. Fig. 8. presents a schema 
graph for G. There are two sentences of the ECFL given by G 
represented in Fig. 8 

(R.a, R-b, R-A.c, R.A.B.d) 

(R.a, R.b, RB.d, R.B.A.c) 

The sequence of intermediate tuple-types: 

Unnested case: 

(R.a, R.b, R.A), . (R.a, R.b, R.A.c, R.A.B), ( R.a, R.b, R.A.B.d) 

(R.a, R.b, RB), (R.a, R.b, R.B.d, R.B.A.) (R.a, Rb, R.B.d, RB.A.c) 

Nested case for all nested attributes with {} set notation: 

(R.a, R.b, R.A), . (R.a, R-b, R.A-{c,.B}), (R.a, R.b, R-A-{c,.B.{d}}) 

(R.a, R.b, R.B), (R.a, R.b, R-B-{d, A}), (R.a, R-b, R-B.{d, A-{c}}) 


3.2 Instances of Complex Relations with 
Context-free Schemas 


Instances of relational databases with complex values consist of 
a finite number of complex relational schemas and a finite set 
of values of the sort given by each schema. (See [9] Chapter 20.) 
Following this, we define the instance of an extended relation with 
context free schemas. 

For the given CFG we can suppose, without any loss of generality, 
that the language L(A) associated to any non-terminal symbol A is 
not empty. Using this assumption, any traversal generated by Alg. 
2. is a legal schema instance. Legal schema instances are traversals 
accepted by Alg. 3. Let SCH(G) denote the set of schemas given by 
language L(G). 

A schema instance from SCH(G) specifies a tuple-type. For a non- 
terminal symbol B in this sort a tuple-type is given from SCH(B). 
The associated value in a relation instance can be a simple tuple 
(unnested case) or a table as a set of tuples from this type (nested 
case). 

Definition 3. A relation instance of the context-free schema 
SCH(G) is given by the pair (R,I), where R is a finite subset of 
SCH(G), and I is the set of complex valued relation instances for 
each element of R. For reR the corresponding instance is denoted 
by I(x). 

Example 3. 

Let G be the CFG from Example 1. 

G={R=>(ab(A+B)*), A=>(cB*), B=>(dA*)}, An associated XML 
DTD fragment: 

<!ELEMENT TABLE(R*)> 

<!ELEMENT R (a,b,(A[B)*)> 

<!ELEMENT A (c,B*))> 

<!ELEMENT B (d, A*)> 

<!ELEMENT a (#PCDATA )> 

<!ELEMENT b (#PCDATA )> 

<!ELEMENT c (#PCDATA )> 

<!ELEMENT d (#PCDATA )> 

A fragment of XML instance representing the longest schema 
from Example 1, is R(abA[cB[d]B[dcc]). It generates the ordinary 
schema R(a,b,c,d,d,c,c), in dotted form R(a,b,A.c,A.B.d,A.B.d, A.B.A.c, 
AB.A.c). 

Case of no nested type, the corresponding XML fragment: 
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P={(R=>(ab(A+B)*),A=>(cB*),B=>(dA*)} 


G, 


Figure 6: Next generated graph in construction of a schema graph when inserting all current sub graphs 


P=(R=>(ab(A+B)*),A=>(cB*),B=>(dA*)} 


Figure 7: Schema graph for the ECFG in Example 1. 


<R> <a>1</a> <b>2</b> 
<A> <c>3</c> 
<B> <d>4</d> </B> 
<B> <d>5</d> <c>6</c> <c>7</c> </B> 
</A> 
</R> 


Case of nested type; R(abA[cB[d]B[dcc]), the bold B specifies 
nested relation, the corresponding XML fragment: 


<R> <a>1</a> <b>2</b> 
<A> <c>3</c> 
<B> <d>4</d> </B> <B> <d>8</d> </B> <B> <d>9</d> </B> 
<B> <d>5</d> <c>6</c> <c>7</c> </B> 
</A> 
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</R> 


3.3 Operations on Complex Relations 


Next we give some hint to define algebraic operators over SCH(G) 
instances. The details are left to the readers. 

Set operation are defined over use two instances, (Ry , Iz), and 
(Rz , Iz). First, we take the operation over the schemas, union, 
intersect, minus of Rz and R2. Then for each pair r; ¢ Ry and 
r2 € R2 - where r; and rp are of same sort, we use the standard 
relational set operation . 

Cross product can be defined over instances from two schemas 
generated by context-free grammars G, and Go. Using the standard 
notation SCH(G,;) X SCH(G2)= SCH(G1Gz2), where G;Gz2 is the 
grammar of the concatenated languages. At the instance level, for 
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(Ri, Iz) , and (R2,, Iz). form SCH(G1) and SCH(Gz2) respectively 
the cross product is taken for each pair each pair rj ¢ Ry and r2 € 
Ro. 

The most specific operation on nested relation is the pair Nest 
and Un-nest. These operations are defined on two instances rj ¢ 
R and rz € R, The traversals on the schema graph are the same 
for rj and r2 but the choice of nested version or simple version of 
traversing the graph of a non-terminal B is different. See the exact 
definition these operation in [9] Chapter 20.2. 

Usually, a projection operator result in an instance of a sort that 
does not belong to SCH(G) It is possible to define a special projec- 
tion operation when we allow for the role of some non-terminal to 
include the empty sentence. This means that a short cut of travers- 
ing a non-terminal vertex is possible. So, we can project out an 
attribute of this type and the result will remain inside SCH(G). 

The situation with the natural join is similar, the result will not 
remain in SCH(G). It is possible to define a special language opera- 
tor which associates a context-free language to the corresponding 
natural join operator on the instances of SCH(G). The common 
attribute list of instances r; ¢ R and r2 € R, is given by the longest 
beginning common path of the traversals. Let the traversals be xy 
and xz. The result of the natural join is of sort given by xyz. This 
way we defined a new product-like operator on as schema graph, 
the self-natural join graph. 


4 FUNCTIONAL AND KEY DEPENDENCIES 
ON SCHEMAS FOR FORMAL LANGUAGES 


4.1 FD on Extended Context-free Languages 
with Restricted Grammar 


Functional and key dependencies on ECFGs have been defined in 
[2]. Those definitions used a restricted ECFG allowing production 
rules of two forms only: 

A => Ra, where A EN, Rag is a regular expression over N, 

A =>u, where AEN,ue T. 

Moreover, the two sets of non-terminals those were LHS in group 
1. or 2. were disjoint. 

With these restrictions, the dependency definitions for regular 
languages could be applied for ECFGs too. Notice that the used 
schemas were ordinary: they contained no nested attributes. In 
the following we use ECFL/ECFG without the restrictions 1. and 2. 
above, applied in [2]. 


4.2 FD on Extended Context-free Languages 
using Nested Attributes 


Definition 4. (Extended Relation for Extended Context-free Lan- 
guage). Let L be an ECFL and let G (G=(N,T.R,P)) be its generating 
grammar. We say that the set of terminal symbols T is a set of 
attribute names. Let w=w, ... Wn € La sentence, then we say that 
w is an extended relational tuple type over T. Let domy; u € T be 
sets of data values, then {(w:aj, ..., Wn:an) | aj € domy}} is the set 
of possible tuples of type w. A finite subset of these tuples is an 
instance of the extended relation. We say that the set of the tuple 
types for all w € L compose the schema of an extended relation 
based on L. Each tuple type complies with an (IN,....OUT) traversal 
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on the schema graph SCH(G). The tuple types of all tuples in an 
instance compose the schema for the instance. 

Let L be an ECFL and let G (G=(N,T.R,P)) be its generating gram- 
mar. In order to specify the (left and right) sides of a functional 
dependency we should pick up two sets of attributes (either ordi- 
nary — Def. 4. - or nested ones), one set for the left side (denoted by 
X), another one for the right side (denoted by Y). Using a schema 
from SCH(G) we can select ordinary attributes for both sides of an 
FD and we can define syntax and semantics for this FD similarly to 
the case of regular FD [2]. In the following, we deal with scoped 
FD on ECFL using nested attributes. 

We can choose nodes visited by a traversing and state that each 
visiting of these nodes would be selected. We can choose starting 
and ending points for a path in the traversing, so that this pair of 
nodes will be selected at each closing of that path. 

Let G=(N,T.S,P) be an ECFG. For each production rule, A => 
Ra the regular expression Ra denotes a regular language La © 
(N U T)*, the corresponding graph-representation (denoted by 
G(Ra)) can be constructed according to Alg. 1. During derivation 
by the grammar G we get the (not necessary regular) language 
generated from A: we denote L(A) C T* the strings derived from A 
by the rules in P. 

Remark 2. 

We can assume without loss of generality that each non-terminal 
symbol occurs once as LHS in a production rule, because the rules 
A=> Re and A => RY can be replaced with the rule A => (Ri, + 
R) so that the obtained grammar is equivalent with the original 
one [10J.] . 

Let t € L(A), let U € N be a non-terminal symbol so that U € 
Ra. We can interpret U as attribute of A. U, as start symbol is the 
root of the ECFG Gy=G(N,T,U,P), so when generating t by G, some 
sentences of L(U) will be generated by the way, let these sentences 
be uj,...,Ux (k > 1). Then t= @ 1U}... UK OKi1, Where w; € T*,1< 
i < k+1. We interpret the projection of U to t as t{U]=u... ux. 

Definition 5. Let G=(N,T,S,P) be an ECFG, let x € (N U T 
)* be a symbol string, then the language L(«)C T* is the set of 
all strings that can be derived from « by the production rules in 
P for the non-terminals and letting each terminal symbol on its 
place. Formally, let x=0cja9...%p,, «) € N U T,1 <i <n, then 


+ 
L(x)={a € T*| @ = @1@2...@p}, so that either a = @;, where 


es is the transitive closure of the derivation, when aj. € N, or 
Oi = a, whena; €T. 

According to Def. 5. we can assign values (taken from a non- 
empty domain set D) to the terminal symbols. Let u € T be terminal 
symbol, then let Dy € D be a set so that when u, v € T, u # v then 
Du 1 Dy = 0. 

The mapping val: u € T — Duy assigns a domain value to a 
terminal symbol so, when the assignment will be made for a string 
of terminal symbols, then each assignment occurs autonomously, 
that is, val(u;) # val(uz) can occur also when u; = uz. Obviously, 
when uv, # U2 then val(u;) # val(uz). 

For w € L(A), let o = {ujuz...un} then val(w) = {viv2..vy} is a 
valuation of w, where vj =val(uj), 1 <i <n. 

Definition 6. (Complex valued tuple) [9] 

Let G=(N,T,S,P) be an ECFG, let A € N be a non-terminal sym- 
bol and let « € La, «=(AzA2...An}, Ai € N U T, (1 <i <n) 
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be a string of either non-terminal or terminal symbols. Let 6 = 
(By,..., By), (0 < k « n) the string of non-terminals in «. We say 
that the B;-s are nested attributes of oc, which is a sentence format 
in the scope A. Let y = (Cj,...,Cy), (0 < 1 «n) the string of ter- 
minals in «. Obviously, k + 1 =n. Let w € L(x), @ ={w1 2... @p} so 
that either w; € L(Aj), 0 < i < n,when Aj €N, (0 <i <n), 
or @ = Aj, 0 < i < n,when Aj ET, (0 <i <n). A valuation 
of w is a complex valued tuple t of A, denoted by tey=valcy(w). 

Regular functional dependencies are presented in [2] syntac- 
tically defined on the graph for the accepting FSA of a regular 
language, and semantics were given for them on sentences of the 
language. We extend this definition to ECFG with nested attributes 
so that the syntax of the FDs will be defined on a single regular ex- 
pression (using one production step only), but for the semantics we 
use the whole derivation tree of the ECFG. With this restriction we 
can yet handle most real-life applications, meaning "horizontally" 
connected data, and it allows a quadratic complexity of implication. 

Definition 7. (Assignment) 

Let L be an ECFL and let G=(N,T,R,P) be its generating grammar. 
Let A € N be a non-terminal symbol and let G(Ra)=(V,E) be the 
graph representation of Ra (Alg. 1.). We say that the tuple Y= 
(Y1,Y2), where Y1 C V and Y2 is a sub graph of the transitive closure 
of G(Ra) is an assignment on G(Ra). Y1 is taken from the non- 
recurred part of G(Ra), Y2 refers to nodes and edges whose are 
(could be) repeatedly visited during a traversing. 

Let Y be an assignment, Y selects a unique subsequence from a 
given sentence format as follows: 

Let G=(N,T,S,P) be an ECFG, let A € N and let G(Ra) be the corre- 
sponding graph representation. Let w={v1,V9....,Vn} be a traversing 
on G(Ra). 

Definition 8. (Selection on Scope). Let Y= (Y1, Y2) be an assign- 
ment on G(Ra) for the scope A and let w be a traversing on G(Ra). 
The symbols in Y; will be selected in order of their exploration 
(when visited). For each edge e € Y2 when the edge will be closed 
on the shortest path between its endpoints during the traversing 
on w, these two endpoints will be selected in their succession order 
(when visited at all). That is, if the two endpoints of the closing 
path are A and B (A=,;,B=v; forsome 1 < i <j < n) then 
that path will be selected which does not contain neither A nor B. 
The nodes in Y2 will be selected by each visiting (if any) during 
the traversing on walk (w). The selection will be processed for all 
edges and nodes in Y2 autonomously. By the end of the selection 
the from w selected symbols build up the (possibly empty) array 
w[Y]=(vi,,-.-,Vi,) 1 Sit < ig <... < ip < n(k > 0). 

Let w be a traversing on G(Ra), let w € L(w), and let t=val(w) 
be a tuple of A. We interpret the w[Y] sequence of symbols as set 
of "attributes" that projects the tuple ¢ to the values t[Y]=val(w 
[Y]), that is, let w=(v1,vo.,....Vn), let w[Y]=(vi,,-.-,Vi,) (1 <i < 
ig <... < i, < n(k > 0) and let w = {a2 ...@n} then 
t[Y]=val(;, val(a;,)... val(wj,). 

If w[Y]={}, then t[Y]={} as well. 

Concerning the regular language La we can define functional 
dependency over G (Ra), considering the non-terminal A as the 
scope for the functional dependency (Def. 8.). 

Definition 9. (Scoped Functional Dependency) 

Let A be a scope in the ECFG G and let G(Ra) be the corre- 
sponding graph representation. Let X=(X1,X2) and Y=(Y1,Y2) be 
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two assignments (Def. 7.) over Ma. A functional dependency de- 
fined over G(Ra) (FDa) is an expression of the form X — Y. The 
R (finite) database instance of A satisfies the X — Y functional 
dependency (denoted by R |= X — Y), if for any two tj,tz € R tuples 
t,[X]=t2[X] can be fulfilled only then, when t; [Y]=t2[Y] also comes 
true. We call the case Y= G(R) key dependency. 

Example 4. The example XML instance (Fig. 8) describes data 
of course participant students, the corresponding DTD contains 
the following declaration: 

<!ELEMENT Courses (Course+)> 

<!ELEMENT Course (Cid,Std+))> 

<!ELEMENT Std ((Stid,Stn,Stl)+)> 

Based upon this declaration we can define the following two 
scoped functional dependencies (Def. 9.): 

Scope Course: ({Cid}.{}) > ({}.{Stid,Stn,Stl}) : key dependency 

Scope Std: ({Stn},{}) — ({Stl}.{ }) 


5 CONCLUSION AND FUTURE WORKS 


This paper presents schema graphs for extended context-free lan- 
guages, based on the graph representation for the regular expres- 
sions and defines functional and key dependencies and set opera- 
tions over them. 

Our model offers the tools for a normal form of extended relations 
on ECFG, but to specify a normal form for extended relations on 
ECFG is a hard problem, because the set operations on schemas can 
lead out from the world of context free languages: the intersection 
of two ECFLs is not context-free. As a future work we could try to 
find a possibility to specify a normal form for extended relations on 
ECFG. Another possible theme to continue our work is to describe 
join dependencies for ECFG. 
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ABSTRACT 


Generating synthetic data similar to realistic data is a crucial task 
in data augmentation and data production. Due to the preservation 
of authentic data distribution, synthetic data provide concealment 
of sensitive information and therefore enable Big Data acquisition 
for model training without facing privacy challenges. Nevertheless, 
the obstacles arise starting with acquiring real-world open-source 
data to effectively synthesizing new samples as genuine as possible. 
In this paper, a comparative study is conducted by considering the 
efficacy of different generative models like Generative Adversarial 
Network (GAN), Variational Autoencoder (VAE), Synthetic Minority 
Oversampling Technique (SMOTE), Data Synthesizer (DS), Synthetic 
Data Vault with Gaussian Copula (SDV-G), Conditional Generative 
Adversarial Networks (SDV-GAN), and SynthPop Non-Parametric 
(SP-NP) approach to synthesize data with regard to various datasets. 
We used the pairwise correlation and Synthetic Data (SD) metrics 
as utility measures respectively between real data and generated 
data for evaluation. Accordingly, this paper investigates the effects 
of various data generation models, and the processing time of every 
model is included as one of the evaluation metrics. 
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1 INTRODUCTION 


Synthetic data refers to artificial information rather than those that 
are recorded from real-world events via direct measurement [7]. 
Artificial data is used when real data is not available, cannot be 
used due to privacy concerns and avoids business data vulnerable 
to data breaches. 
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Using real data for different purposes like algorithm testing, 
Machine Learning training or various Data Science applications 
in business and industries suffer from different problems: original 
data is often highly secure, takes a lot of time to be accessible, 
and it cannot be used for testing hypothetical scenarios. Therefore, 
generating synthetic data is quite important for researchers and 
business developers to overcome real data usage restrictions. It 
allows to simulate not yet encountered conditions and it can be 
generated to meet specific needs or conditions that are not available 
in existing (real) data. Another familiar constraint is the lack of 
specific characteristics in a dataset required for certain applications 
or domains, which typically cannot be obtained effortlessly and 
economically. 

In practice, often scanty authentic data serve as template from 
which synthetic data can be produced algorithmically, e.g. when 
privacy requirements limit data usage and variability. Typical ap- 
plications where synthetic data is a "must-have" are autonomous 
driving, financial services, healthcare, model training, and consumer 
behaviour evaluation in marketing and social media analysis. 

Research communities and organizations involved in Machine 
Learning development need adequate datasets consisting of various 
characteristics for experimental purposes. Large, accessible and 
privacy compromising-free datasets therefore are much desired. 
Nevertheless in many cases, real data is often not sufficient to com- 
prise such a dataset that meets the demand required to carry out 
experiments and approaches due to numerous concerns as afore- 
mentioned. Missing values or corrupted records due to errors in 
measurement or data encryption and storage, data is too expensive 
to acquire due to technological constraints or consent requirements, 
are among the many popular contributing reasons. Synthetic data 
henceforth becomes a promising alternative to alleviate these lim- 
itations and opens up opportunities in a wide range of domains 
like privacy protection, image generation, healthcare, data mining, 
pattern recognition, etc. 

In this paper, we present a comparative study in order to iden- 
tify the most efficient data generation method according to certain 
use cases. Seven models were inspected, namely Generative Adver- 
sarial Network (GAN), Variational Autoencoder (VAE), Synthetic 
Minority Oversampling Technique (SMOTE), Data Synthesizer (DS), 
Synthetic Data Vault with Gaussian Copula (SDV-G), Conditional 
Generative Adversarial Networks (SDV-GAN), and SynthPop Non- 
Parametric (SP-NP). For each use case, identical configurations 
such as working environment, hardware, training dataset (Adult 
Census Data, Airbnb Data, and Airlines Data) from open sources 
and programs are applied to every model to ensure fairness in per- 
formance evaluation, which considers processing time, generation 
speed and generated data quality. To the best of our knowledge, 
many of these models have been widely applied in contemplate 
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works, but there have not been a research that examines their ef- 
ficacy in comparison and thus, this becomes the objective of our 
contribution. 

This paper is structured as follows: Section 2 describes our used 
data generation models, the architecture pipeline, and the used 
datasets. Details on the implementation strategies and challenges 
can be found in Section 3. All experiments and results are presented 
in Section 4. In Section 5 we discuss related work, and Section 6 
contains a summarization. 


2 BACKGROUND 


In this section we provide an outline of our synthetic data genera- 
tion pipeline. We give a brief overview on our architecture, the data 
preprocessing step, the used models and datasets. We also present 
details of our evaluation techniques. 


2.1 Pipeline Architecture 


Our workflow is defined by four steps (Figure 1), starting with data 
collection and preprocessing where the preprocessed data becomes 
the input of the training phase, in which various features and char- 
acteristics such as data types, value ranges, pattern and distribution, 
etc. are extracted and captured. Data quality requirements include 


Data collection and 
preprocessing 


Model training with 


real data 


Synthetic Data 


P Evaluation 
Generation 


Figure 1: Pipeline architecture 


the consistency, accuracy, integrity, timeliness, interpretability, and 
believability of the dataset. The raw data, which is the original 
form of the real data supposed to be used as the training data, most 
often comprises incomplete, inconsistent data and lack of values or 
attributes depending on the different types of datasets. Therefore, 
we transformed the raw dataset into a clean and understandable 
dataset that can be used for training with the generator models 
[11]. In order to achieve this, we performed Data Collection, Data 
Cleaning, and Data Analysis and Extraction. Data Collection involves 
gathering and measuring the featured variables of the datasets en- 
abling the target outcome for the next phases such implementation 
and evaluation. Data Cleaning is applied to check and fill in the 
missing values, remove noise data, detect or remove outliers, and 
correct inconsistent data. Data Analysis and Extraction takes place 
by reviewing the data and checking them for missing values, and 
removing noise data if any. 

Trained models subsequently can be readily used for inference 
tasks, where synthetic data can be generated based on user con- 
figurations queried to the models. Such configurations can specify 
desired sample quantity to be produced, value range as well as distri- 
bution. The data distribution present in the input schema (original 
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data) is achieved to be similar as that of the output schema (gen- 
erated data). Finally, the evaluation is conducted on the generated 
data to assess quality as well as to compare models performance 
according to certain evaluation metrics which will be further de- 
scribed subsequently. 


2.2 Synthetic Data Generation Models 


Seven frameworks considered in this study, namely Synthetic Mi- 
nority Oversampling Technique (SMOTE), Generative Adversar- 
ial Network (GAN), SynthPop Non-Parametric (SP-NP), Synthetic 
Data Vault with Gaussian Copula (SDV-G), Variational Autoen- 
coder (VAE), Data Synthesizer (DS) and Conditional Generative 
Adversarial Networks (SDV-GAN) are outlined in Table 1. A brief 
summary of each model is disclosed in Section 3, while their detailed 
descriptions are referred to the bibliography. 


Table 1: List of used models 


Year | Name Full name Ref. 
2002 | SMOTE Synthetic Minority Oversampling Technique [2] 
2014 | GAN Generative Adversarial Network [6] 
2015 | SP-NP SynthPop Non-Parametric [13] 
2016 | SDV-G Synthetic Data Vault with Gaussian Copula [14] 
2017 | DS Data Synthesizer [16] 
2017 | VAE Variational Autoencoder [8] 
2019 | SDV-GAN | Conditional Generative Adversarial Networks | [17] 


2.3. Datasets 


For our experiments we used datasets consisting of varying records. 
First, the Adult Census Data! (30,162 records with 11 attributes 
after data preprocessing) is the census data from 1994 based on 
income (originally possess 48,842 records with 14 attributes) with 
multivariate dataset characteristics consisting of both categorical 
and numerical data. Second, Airbnb Data? (213,451 records with 
11 attributes after data preprocessing) has details of new users 
from Airbnb with demographics, web records and other statistics 
(initially with same number of records and 16 attributes) holding 
different type of attributes that consists of categorical, numerical 
and timeseries data. And third, Airlines Dataset? (1,046,595 records 
after data preprocessing) has flight travel details of adult passengers 
(1,048,575 records with 30 attributes before data preprocessing) with 
several attributes consisting of categorical and numerical data. We 
have choosen these datasets because of different type of attributes 
holding varying data distribution and enabling the betterment of im- 
plementation process in finding the advantages and disadvantages 
of each model while handling the datasets. 


2.4 Evaluation Techniques 


This section describes the two evaluation techniques used to evalu- 
ate the datasets on the compared models on the basis of it efficacy 
and utility. The used techniques are Pairwise Correlation and SD 
Metrics. 

2.4.1 Pairwise Correlation and Proximity Value. Pairwise correla- 
tion is the correlation distance observed between two variables. 
Correlation distance is a famous method of estimating the distance 


1UCI Machine Learning Repository (2019): https://archive.ics.uci.edu/ml/datasets/ 
census+income 

2 Airbnb: https://www.kaggle.com/competitions/airbnb-recruiting-new-user- 
bookings/overview 

3 AirlinesCodrnaAdult: https://www.openml.org/d/1240 
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between two random variables with limited variances. When the 
distance of two items is zero, at that point they are the equivalent, 
and vice versa [1]. The method evaluates the correlation of real 
data and synthetic data. The concept behind this technique is to 
determine if the relationship between the variables in the real data 
are preserved in the generated synthetic data, which is done as 
follows: 


(1) Firstly by observing the n variables of real data and storing 
it ina matrix df_real.corr. 

(2) Then the n variables of synthetic data are observed and 
stored in a matrix df_synth.corr. 


By exploring the output values from data generation, we check 
whether the pairwise correlation structure of the real data is firmly 
reflected by the correlation structure of the synthetic data and 
then determine the pairwise correlation distance for both real data 
and synthetic data and calculate the difference between them. To 
accomplish this, the correlation difference diff is calculated as 


follows: 
diff = df_real.corr - df_synth.corr (1) 


For a generated dataset of good quality, diff should be close to ’0’, 

which is the proximity value. The proximity values of bad quality 
are far from ’0’. Proximity is the measure of similarity and dissimi- 
larity between the data points. The proximity value referred in this 
case mean the value obtained after the observation of correlation 
distance between real data and synthetic data [5]. 
2.4.2 SD Metrics. SD metrics are developed under the SDV project 
4 [14] both to evaluate the relationship of synthetic data with real 
data and to test the quality of the former one. It is a dataset-agnostic 
tool that supports various data modalities such as single column, 
column pairs, single table, multi table and timeseries. It also in- 
cludes a variety of metrics such as statistical, detection, efficacy, 
privacy, Bayesian Network and Gaussian Mixture. These metrics 
are a combination of several metrics such as CSTest (Chi-Squared), 
KSTest (Inverted Kolmogorov-Smirnov D statistic), KSTestExtended, 
LogisticDetection (Logistic Regression Detection), SVCDetection 
(Support Vector Classification Detection), BNLikelihood (Bayesian 
Network Likelihood), BNLogLikelihood (Bayesian Network Log 
Likelihood), LogisticParentChildDetection (Logistic Regression De- 
tection), and SVCParentChildDetection (Support Vector Classifica- 
tion Detection). Here, values close to 0’ mean that the data is not 
of good quality, and values close to *1’ mean the data is of good 
quality. SD Metrics is available as a library in Python. 


3 IMPLEMENTATION DETAILS AND 
CHALLENGES 
In this section we present more details on the used models and also 


discuss implementation issues and challenges we had so solve in 
order to compare all models. 


3.1 SMOTE 


SMOTE [2] is an over-sampling technique that generates data for 
minority classes by inputting a sample from each of such classes. 
Then it creates synthetic samples based on its corresponding k 
Nearest Neighbors (kNN) in the feature space. The new samples are 
generated by multiplying the difference between the input feature 


4SD Metrics: https://github.com/sdv-dev/SDMetrics 
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vector and that of a selected nearest neighbor with a random value 
between 0 and 1, then add the result to the input feature vector. This 
results in the decision boundary of the minority classes becoming 
more general. 

Implementation. We adopted the SMOTE technique with Label 
Encoding and Logistic Regression. Label Encoding is used in con- 
verting the labels (data in each row or column or cell) into numeric 
format making it easier for the Machine Learning models to pro- 
cess the dataset whereas Logistic Regression helps in predicting the 
variables which are categorically dependent. In order to generate 
duplicate/fake values, SMOTE requires to have one or two columns. 
This allows the data to be manipulated and to produce new random 
values. SMOTE is initialized for oversampling so that the data can 
be transformed. This is done with a random state of 2 and also by 
setting a sampling_strategy ratio to assign the number of samples 
to be generated because the data was earlier split into two forms, 
one with one or two columns and reference and the other set that 
includes all the features of all the columns. 

Challenges. SMOTE faces certain challenges while generating 
synthetic data. The main challenges include overlapping of classes, 
creating additional noise and it is also hard to implement it for very 
high dimensional data. We overcome this by analyzing the samples 
and their respective labels from the results of Label Encoding. The 
hyperparameters were standard to all the 3 datasets where we 
have the random_state as 2 and k_neighbours as 1. By tuning the 
hyperparameters in random_state and k_neighbours, the similarity 
or differences in resulting values were very minute. SMOTE in 
general requires sufficient amount of samples to compare during the 
oversampling phase, we must ensure there are sufficient samples 
(specifically where we have more than 2 labels) in the columns 
(particular column attributes) of the dataset. This was done during 
the data preprocessing where we analyze the data and check if all 
the attributes hold sufficient number of samples possessing more 
than 2 labels. 


3.2 GAN 


Generative Adversarial Network (GAN) [6] is a Neural Network 
(NN)-based approach consisting of two components: a generative 
model G that learns the distribution from training data, then pro- 
duces samples from it, and a discriminative model D that tries to 
distinguish the generated samples as real or fake. The objective of 
the training process is to maximize the error of the discriminator D 
on classifying samples produced by the generator G, thus inherently 
maximizing the synthesized samples similarity to authentic sam- 
ples. Both models are trained independently, in which the generator 
gain more efficiency at producing similar (better) synthetic data, 
while the discriminator becomes increasingly skilled at identifying 
these data. 

Implementation. The implementation of GAN mainly uses Label 
Encoding (similar like Synthetic Minority Oversampling Technique 
(SMOTE)) and Clustering. Clustering helps in dividing the data 
points into groups where the data points of the same group are 
similar to those in the other group. The batch size is set to 100 so that 
the whole set of data is handled in batches of 100 on each iteration. 
Next, two variables are initialized, one for the discriminator and the 
other for the generator with the value 1. It is mainly the number 
of discriminator network updates per adversarial training step and 
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number of generator network updates per adversarial training step, 
respectively. These act as the key hyperparameter arguments which 
are standard for all the datasets. Then the column labels and column 
data are initialized with the columns of the trained data for the 
class and if not for the column labels. The training data with no 
label is initialized with the training data of the data columns. 

Challenges. In general, it is difficult to train GANs as they have 
several problems such as non-convergence and mode collapse. Non- 
convergence is when the parameters are unstable and do not con- 
verge. Mode collapse is when the generator collapses and generates 
limited variety of samples. There occurs an unbalance between 
generator and discriminator that causes overfitting and when it 
comes to hyperparameter selection, they become highly sensitive. 
To overcome this, it is important to balance between generator and 
discriminator to avoid overfitting in the training. 


3.3 SP-NP 


SynthPop Non-Parametric (SP-NP) [13] is a Python package of the 
original R package SYNTHPOP which uses the functions and struc- 
ture of the mice multiple imputation package and extends it for the 
purpose of synthetic data generation. The basic functionality of 
SYNTHPOP is to generate synthetic forms of microdata containing 
sensitive information. It has two modes, namely parametric and non- 
parametric. Parametric is a mixture of logistic regression with linear 
regression, which uses binary, numeric, ordered and unordered fac- 
tor of data types designated in a vector (default parametric method) 
and can be customized if needed. Non-parametric uses Classifica- 
tion and Regression Trees (CART) (default non-parametric method) 
that is based on classification and regression trees, which handles 
all types of variables having predictors and can be applied for all 
data types. In this paper, we use only the non-parametric version. 

Implementation. The implementation of SP-NP involves Label 
Encoding, similar to SMOTE where the input data is encoded and 
given as input to the generator function. In general, SYNTHPOP 
requires dtypes (data types) as one of its main hyperparameters 
apart from the feature contents. So it is necessary to specify the 
data types of every column attribute in this dtypes respectively for 
each dataset. The results obtained from the generator is decoded to 
the original input data structure which serves to be our synthetic 
data. 

Challenges. SYNTHPOP had to be customized for every single 
dataset based on their attribute type. It also does not work well 
with float and categorical data formats. This is the reason why we 
use Label Encoding to overcome this challenge. 


3.4 SDV-G 


Synthetic Data Vault with Gaussian Copula (SDV-G) [14] is a frame- 
work that generates synthetic data by utilizing a multivariate model 
derived from the intersection of various tables in a relational data- 
base. The modeling of the whole database is performed by taking 
input consisting of the data tables themselves and their correspond- 
ing metadata. The relationship between tables is computed recur- 
sively using Conditional Parameter Aggregation (CPA) to handle 
foreign key relations, while the relationship between columns of a 
table is computed using multivariate Gaussian Copula to calculate 
covariance. The framework is capable of performing model-based 
and knowledge based synthesis, where the former allows user to 
synthesize a complete database using solely the computed model, 
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while the latter allows the completion of data based on some given 
information. 


Implementation. The framework of SDV is advantageous for 
direct implementation for all types of dataset. Input and generated 
synthetic data are in the form of dataframes. It is also to be noted 
that Label Encoding can be used for all the SDV models. SDV-G is 
implemented directly by only assigning the Gaussian Copula model 
for the generator. 


Challenges. In general, Gaussian Copula works only for linear 
correlation problems and not when there is tail dependence. Tail 
dependence is a concept of clustering of large events which are 
extremely important for risk management problems. So, for this 
task SDV-G servers to beneficial, but for clustering problems it is 
in question. 


3.5 SDV-GAN 


Conditional Generative Adversarial Networks (SDV-GAN) [17] is 
an adapted framework of SDV-G using Conditional GAN (CTGAN), 
a data generator to generate synthetic tabular (categorical) data 
based on GAN, which can handle varied feature types for both 
discrete and continuous data that claim to outperform the existing 
GAN models, Variational Autoencoder (VAE) and Bayesian Net- 
works when applied on benchmarking datasets. This also includes 
some new techniques such as mode-specific normalization to aug- 
ment the training process, change in architecture and to resolve the 
data imbalance by implementing conditional generator and training 
by sampling. We call this model as SDV-GAN since CTGAN is also 
a part of the SDV-G framework. 


Implementation. Since the framework of SDV can be directly 
implemented for all types of datasets where the input data is in 
the form of a dataframe and the generated synthetic data is also 
in the form of a dataframe, the setup is similar to the implementa- 
tion of SDV-G. So, SDV-GAN is also implemented directly by only 
assigning the CTGAN model for the generator. 


Challenges. SDVs also have problems with producing complete 
results when the samples are too less for an attribute type. Larger 
datasets with sufficient amount of categorical data can overcome 
this issues and also smaller datasets, when trained multiple times, 
overcome this in certain cases. 


3.6 VAE 


VAE [8] is a NN-based approach capable of capturing complicated 
data distribution and produce synthetic data that are similar to the 
original data, which consists of two components: an encoder and a 
decoder. The former captures dimensional dependencies by map- 
ping the input data into the latent space, while the latter takes the 
latent variable as input to generate samples. Mathematically, VAE 
is classified as a variational Bayesian method and thus is different 
in formulation to autoencoders despite the architecture similarity. 

Implementation. The implementation of VAE is as difficult as 
GAN, escpecially for the purpose of generating categorical data. 
Here, we use Label Encoding just like SMOTE, GAN and SP-NP. 
Since VAEs already possess an encoder and decoder, the process of 
generation takes more time compared to any other model. But it is 
not possible to avoid the label encoding because the encoder and 
decoder functions result in some null values when implemented 
directly. 
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Challenges. The main challenge of VAE is the consumption of 
processing time. In future, we might be able to overcome this when 
implementing and running on a GPU environment. 


3.7. DS 


Data Synthesizer (DS) [16] is a framework consisting of three mod- 
ules: a Data Describer, a Data Generator and a Model Inspector. The 
Data Describer inspects the input data types, correlations and at- 
tributes distribution to create a summary, from which synthetic 
samples are generated by the Data Generator. The Model Inspec- 
tor is responsible for visualizing the summary, allowing accuracy 
evaluation and parameters adjustment. It has three modes: the de- 
fault correlated attribute mode, the independent attribute mode for 
the cases of expensive correlation computation cost or inadequate 
samples, and the random mode for cases of exceedingly sensitive 
data. 

Implementation. The framework of DS is implemented directly 
using the random mode as we need our data to be handled sen- 
sitively, but it has problems in generating timeseries data. We in- 
cluded the timeseries generation by using the datetime Python 
library. The main implementation is carried out by inputting the 
dataset file as a whole and drawing its features into a JSON file 
which acts a medium in generating the synthetic data having the 
same features as the real data. This is one of the reasons why it 
is time efficient. The threshold value is the hyperparameter for 
Data Describer which is standard for all the datasets and it is noted 
that this threshold is always less than the domain size when the 
attribute is categorical. 

Challenges. Inability to generate timeseries is a major drawback 
to the model. But if this can be fixed using some additional time- 
series generation methods, then this serves to be one very beneficial 
for synthetic data generation as the time consumption is very less. 


4 EXPERIMENTS AND RESULTS 


In this section, we present our comparative study and discuss the 
experiments conducted elaborately by describing and presenting 
the results obtained throughout the experimental phase °. 


4.1 Experimental Setup 


The experiments are fourfold to the goals of this paper after suc- 
cessful completion of the implementation and evaluation. The ex- 
periments are mainly carried out on different datasets stored in 
CSV files, namely Adult Census Data (30,162 records), Airbnb Data 
(213,451 records) and Airlines Data (1,046,595 records). For more 
details we refer to Section 2.3. 

The environmental setup for the experiment consists of Python 
3.9, Jupyter Notebook with Anaconda3, PyCharm IDE, and required 
libraries installed on a Ubuntu platform. The Ubuntu platform is 
a server with Ubuntu 20.04, 754GB RAM, Intel(R) Xeon(R) Gold 
processor and CPU at 2.60GHz. The Data Preprocessing is done 
using Jupyter Notebook, Synthetic Data Generation and Evaluations 
are carried out on PyCharm. 


4.2 Accumulated Results 


The results obtained from the experiments are depicted in Tables 
2 — 5 as well as in the form of heatmaps (Figures 2 — 4 in the 
Appendix). Finding the proximity and SD Metrics (cp. Section 2.4) 


5Source Code: https://github.com/ashamvenu/SyntheticData.git 
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is evaluated for each column of synthetic data against each column 
of real data. The models have good proximity, where ’0’ means 
we have a closer proximity and any value far from ’0’ means the 
proximity is low. Also we have a good SD Metric score, where 
*1’ means the generated synthetic data is of good quality whereas 
*0’ means the quality of data is low. The heatmaps represent the 
pairwise correlation distance between each column attribute of the 
synthetic data and each column attribute of the real data. 

We performed different experiments on different datasets as 
described below: 


(1) The first experiment begins with generating and evaluating 
the Adult Census Data with 10K records for the 7 models 
SMOTE, GAN, SP-NP, SDV-G, SDV-GAN, VAE and DS. 

(2) The second experiment is with full Adult Census Data where 
we have 30,162 records for which we perform experiments 
for the models SMOTE, GAN, SP-NP, SDV-G, SDV-GAN and 
DS. We excluded VAE here because it took longer runtime 
of about 4 days and resulting in a value error occurred from 
the float point matrix of the generated dataset. 

(3) The fourth experiment is for Airbnb Data (213,451 records) 
which includes all necessary types of attributes such as cate- 
gorical, numerical and timeseries. Here, we performed ex- 
periments for the models SMOTE, SP-NP, SDV-G and DS 
and not for GAN, SDV-GAN and VAE, because the models 
produce memory error due to huge number of records. 

Finally, the last experiment is conducted on Airlines Data 

(1,046,595), which possess more than a million records on 

models SMOTE, SP-NP, SDV-G and DS. Similar to the Airbnb 

Data, we do not perform experiments on GAN, SDV-GAN 

and VAE. 


—~ 
ss 
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4.2.1 Results on Adult Data. The first experiment with 10K records 
of Adult Census data is conducted and shows that the models 
SMOTE, GAN, SP-NP, SDV-G, SDV-GAN, VAE and DS are suc- 
cessfully implemented. From the represented results in Table 2, it is 
feasible to understand the performance nature and the comparison 
of the proximity levels and SD Metrics. 


Table 2: Adult Census Data (10K records) 


Adult Census Data (10K records) 

Proximity Level | SD Metrics | Processing Time 
GAN 0.0127 0.8564 26.2 minutes 
VAE 0.1903 0.3704 21.2 hours 
SMOTE 0.0055 0.7833 2.6 minutes 
DS 0.0190 0.0967 0.4 seconds 
SDV-G 0.0002 0.6434 7.0 seconds 
SDV-GAN 0.0065 0.6830 3.8 minutes 
SP-NP 0.0007 0.8595 2.6 minutes 


Table 3 shows the results conducted on the Adult Census Data 
with 30,162 records, whereas the heatmaps (Appendix, Figure 2) 
represent the pairwise correlation distance of synthetic data and 
real data. The models SDV-G and SP-NP have better scores with 
respect to the proximity level, while SMOTE, GAN, SP-NP and SDV- 
GAN have better scores with respect to SD Metrics and in terms of 
processing time, DS and SDV-G outperform the other models, but 
SMOTE and SP-NP have resulted in a reasonable processing time. 
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Table 3: Adult Census Data Results 


Adult Census Data 

Proximity Level | SD Metrics | Processing Time 
GAN 0.0402 0.8621 1.9 hours 
SMOTE 0.0372 0.7296 8.9 minutes 
DS 0.0219 0.0977 1.3 seconds 
SDV-G 0.0029 0.6426 17.9 seconds 
SDV-GAN 0.0100 0.7134 25.7 minutes 
SP-NP 0.0054 0.8644 9.0 minutes 


4.2.2 Results on Airbnb Data. We carried out the experiment on 
the Airbnb Data. The models of GAN, VAE and SDV-GAN failed to 
run, producing a termination of the program due to huge amount 
of data samples ultimately causing a memory error. Hence, this 
experiment was performed only on SMOTE, SP-NP, SDV-G and DS. 
The results (Table 4) obtained from Airbnb data show that SMOTE 
and SP-NP result in better proximity compared to DS and SDV-G. 
In terms of processing time, DS scores far better than the other 
three models. The heatmaps found in the Appendix, Figure 3, show 
the correlation distance between the real data and synthetic data 
for each column and feature in the Airbnb Data. Table 4 depicts the 
results in terms of proximity level and processing time of Airbnb 
Data. SD Metrics is not evaluated for Airbnb Data as the SD Metrics 
is not capable for large datasets and result in memory error as 
SD Metrics framework is designed to evaluate the dataframes of 
synthetic data and real data as a whole. 


Table 4: Airbnb Data Results 


Airbnb Data 
Proximity Level | Processing Time 
SMOTE 0.0036 1.5 hours 
DS 0.0093 33.6 seconds 
SDV-G 0.0102 2.6 minutes 
SP-NP 0.0008 1.6 hours 


4.2.3 Results on Airline Data. Similarly to Airbnb Data, the ex- 
periments are carried out only on SMOTE, SP-NP, SDV-G and DS 
models. The results (cp. Table 5) show that SMOTE and SP-NP 
result in better proximity compared to DS and SDV-G. Also here, 
the processing time of DS is far better than the other three models. 
Again, the heatmaps are shown in Figure 4 and present the corre- 
lation distance between the real data and synthetic data for each 
column in the Airlines dataset. Table 5 depicts the results in terms 
of proximity level and processing time of Airlines Data. Similar to 
Airbnb Data, we did not evaluate SD Metrics due to memory errors. 


Table 5: Airlines Data Results 


Airlines Data 


Proximity Level | Processing Time 
SMOTE 0.0148 7.7 hours 
DS 0.0287 90.3 seconds 
SDV-G 0.0282 11.9 minutes 
SP-NP 0.0008 6.9 hours 
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5 RELATED WORK 


Synthetic data generation has been a significant research topic for 
the past two decades. However, most of the research work is based 
on artificial image generation and the focus on text data has been 
rising only in the recent years. There are many papers dealing with 
synthetic data generation and we are not able to mention all of 
them. In this section, beside the models proposed in Section 2 and 
3, we mention the most relevant work related to our study. 

In a comparative work conducted by Dandekar et al. [4] on Lin- 
ear Regression, Decision Tree, Random Forest and Neural Networks, 
the results show that NNs are more effective w.x.t. utility and privacy 
on the basis of running time. However, the evaluation also shows 
that the normalized Kullback-Leibler divergence scores are more 
or less the same for all the four models. Considering the importance 
of implementing flexible models for synthetic data generation by 
calculating nuances of multivariate structures, Manrique-Vallier 
and Hu [12] proposed a Bayesian non-parametric method to pre- 
serve complex multivariate data relationships between different 
variables subject to structured zeros by a tool called Truncated 
non-Parametric Latent Class Model (TNPLCM) using Full Condi- 
tional Specification (FCS) approach, CART method and Random 
Forests. This results in producing high quality of analytical data 
which exposes low risk. For this reason, we have used SP-NP in our 
research, which uses CART method for its non-parametric version. 

Peng and Telle [15] published a complete tool for synthetic data 
generation, employing three algorithms namely Random Data Gen- 
eration, Decision Tree and Multilinear Regression for different use 
cases depending on their respective data mining pattern. While 
being satisfactory on the aspect of software functionality, adequate 
evaluation of output quality and processing time was not provided. 

Beside textual data, synthetic time series generation is also in 
demand, particularly in domains where sensor data analysis is 
involved such as healthcare applications. Dahmen and Cook [3] 
introduced a Machine Learning approach based on hidden Markov 
models and models. The generated data have been evaluated us- 
ing time series distance measures against the authentic, manually 
annotated smart-home data and exhibited to be highly realistic as 
well as capable of improving the accuracy of the model it was used 
to train on. However, the results show only the accuracy of which 
we find the scores to be not satisfactory considering the quality of 
the generated data in comparison with the real data. 

Lecanzo and Arias [10] introduced three different generation 
methods with traditional datasets based on item-based generative 
models, where two of the models Itemset Generating Model (IGM) 
and Interesting Itemset Miner (IIM) are over itemsets and the third 
one Latent Dirichlet Allocation (LDA) is using textual corpora. Eval- 
uation is carried out based on characteristics (pattern similarity), 
preservation of itemsets, privacy and runtime. This determines the 
strength and weakness of each model in which IGM has the lowest 
learning phase runtime, while IIM scores best in data generation. 
Though the results depicts the runtime in seconds, the amount of 
data used is unclear to consider the evaluation. 

Leduc and Grislain [9] proposed an architecture called Compos- 
able Generative Model (CGM) which is an auto-regressive model 
that inputs column embeddings through a transformer, evaluated 
using Synthetic Data Gym (SDGym) benchmark and claiming CGM 
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to be the best state-of-the-art approach. The concept of using an 
encoder, decoder and loss function seems reliable but the evaluation 
scores do not justify the claims with respect to synthetic data. 


6 CONCLUSION AND DISCUSSION 


In this paper we considered the challenge of generating synthetic 
data based on real data. The objective of our paper was to compare 
several state-of-the-art approaches to synthesize data in order to 
find out the efficiency of the proposed models. We used different 
real world datasets as a basis for our data-driven tasks, namely 
Adult Census data, Airbnb data, and Airlines data to analyze and 
depict the difference in performance as well as the behaviour of 
all seven models. We conducted several experiments of which the 
initial experiments show that SMOTE, SP-NP, SDV-G and DS are 
better among the 7 models chosen in terms of proximity, SD Metrics 
and processing time. GAN, SDV-GAN and VAE are not capable for 
large datasets, hence we run further experiments on SMOTE, SP- 
NP, SDV-G and DS to analyze which among these 4 models give 
us more promising results. By the end of all the experiments, we 
come to a conclusion that SMOTE and SP-NP are the most effective 
methods in terms of proximity and SDV-G and DS are effective in 
terms of processing time. 

Our next step will be to consider models for text data generation, 
which is out of scope in this work due to being a more challenging 
task. Unlike the categorical data types, natural language generation 
involves language modelling which is a complicated and demand- 
ing process that requires more sophisticated model architectures, 
enormous amount of data, robust hardware to accommodate and 
longer training time. In addition, evaluating generated text quality 
is also an obstacle to be addressed, as identifying suitable metrics for 
automatic validation can be much less straightforward and counter 
intuitive, while human evaluation is certainly way too expensive 
to be included. 
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ABSTRACT 


Low Code is a technology that has been gaining popularity over the 
years, due to its potential and simplicity. But so far there has not 
been an experimental evaluation with other programming methods. 
This paper aims to introduce Low Code development technology 
and compare it with Java Swing programming and manual develop- 
ment with HTML (HyperText Markup Language), CSS (Cascading 
Style Sheet), and JavaScript. These technologies are compared using 
the following metrics: development time, execution time, and the 
number of written code lines. In this evaluation, two applications 
are implemented, a simple calculator, and a text editor, developed in 
all technologies. It is concluded that it is faster to develop applica- 
tions in Low Code but in terms of execution time, these are usually 
slower. Although the Low Code development is still at a somewhat 
embryonic stage which leads to some bugs and errors, Low Code 
development is better in general than Java Swing programming, 
and somewhat similar to manual programming with HTML, CSS, 
and JavaScript. Another benefit is that Low Code generates HTML 
and CSS automatically. 
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1 INTRODUCTION 


Low Code development is a technology that facilitates program- 
ming by diminishing the handwritten code and allowing non- 
programmers to build and have a more active presence in the de- 
velopment process of the application. Low-code is a collection of 
tools that enables developers to avoid hand-coding and reduces the 
development effort of having an application ready for production. 

There are many ways of developing an application but when it 
comes to programming, users sometimes have difficulties in choos- 
ing the language and the type of programming. For Low Code 
development, Neptune Planet 9 is used, a Low Code Development 
Platform (LCDP), for Java Swing programming NetBeans IDE (Inte- 
grated Development Environment), and for manual development 
with HTML, CSS, and JavaScript Visual Studio Code is applied. 

Java is one of the most recognized and used programming lan- 
guages, and Swing is a Java toolkit that allows programmers to eas- 
ily build a User Interface (UI) for their Java applications. JavaScript is 
a lightweight, interpreted, object-oriented programming language 
mostly used to program web pages. 

These technologies are chosen, Java Swing and JavaScript to 
compare with Low Code because Java Swing has a similar way of 
making the UI but runs on the operating system, unlike Low Code 
applications that run on a browser. JavaScript is chosen because 
it is the same type of programming that LCDPs use, but the UI 
development is manual. 

These technologies can use database systems as external ser- 
vices, where the communication with the database is made through 
an API (Application Programming Interface) or another commu- 
nication protocol, where a request is sent to the database and an 
answer is given in response. Depending on the database it is possi- 
ble to manage the database through these requests, facilitating the 
development work. 

In this paper, Low Code development, Java Swing programming, 
and manual development with HTML, CSS, and JavaScript are com- 
pared, by developing two applications in all types of programming, 
which has not been done before according to [1]. The platforms 
chosen to develop the apps are based on our experience with these 
types of platforms and because they are free of charge. The objective 
of this study is to help users to know what type of programming 
to choose through the comparative evaluation of the developed 
applications. 

The rest of this paper is structured as follows. Section 2 intro- 
duces background concepts and the methodology applied. Section 3 
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App Development Process 


Deploy / 
Transport 


Figure 1: Low code app development process. Source: [6]. 


exhibits the results. Section 4 presents the discussion of the results. 
Section 5 describes the open problems. Finally, section 6 presents 
the main conclusions and future work. 


2 BACKGROUND 


This section introduces Low Code development, LCDPs, Java Swing, 
HTML, CSS, JavaScript, Node.js, IDEs, and the methodology used 
in the experiments. 


2.1 What is Low Code? 


The term ‘Low Code’ was introduced by Forrester market research 
company in June 2014, where these platforms are described as 
extraordinarily disruptive [2]. 

Low Code applications are developed using model-driven engi- 
neering principles and taking advantage of cloud infrastructures, 
automatic code generation, declarative, high level, and graphical 
abstractions to develop entirely functioning applications, meaning 
that these applications are mostly made through drag and drop of 
objects [3]. 


2.2 Low Code Development Platform 


LCDPs emerged in the early 2000s helping development teams work 
faster. The Low Code Development Platform market started back 
in 2011, speeding up the development and maintenance processes. 

An LCDP is set on a cloud or locally [3], allowing the develop- 
ment of Low Code applications using minimal code writing. Its 
objective is to give, to different types of users, the possibility to 
create applications in an easy, simple, and fast way [4]. 

Each LCDP has its programming language, such as Java, 
JavaScript, Python, and others [2]. LCDPs allow the development 
of distinct types of applications such as web apps and mobile apps 
[5]. 

The development of a Low Code application has the following 
processes: API Setup, App Creation, LaunchPad, Security Setup, 
Mobile Client Build, and Deploy/Transport, as shown in Figure 1. 

API setup is a process where the developer creates APIs, which 
can be made in an LCDP or imported. The App Creation process 
is where the developer creates the app/module. In this process, 
LaunchPad is used to transform a module or various modules into 
a suitable application. The Security setup process is where is set 
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up who has access to the application or which parts of it. Mobile 
Client Build process is where the application is prepared to become 
a mobile app, and at last Deploy. It should be mentioned that not all 
of these processes are required to implement an application, which 
depends on the type of application that is developed [6]. 


2.3 Java Swing 


Java is an object-oriented language, and Swing is a widget toolkit 
GUI (Graphic User Interface). Swing started to be developed in 1996, 
supports multithreading, and allows Java programmers to easily 
create a UI for an application. 

Some of the Swing components are labels (text), text areas, but- 
tons, tables, frames (application page), combo box, scroll pane (ob- 
ject to scroll page), file chooser, (to get a file for reading or to save), 
menus, toolbars, and others [7]. 


2.4 HTML 


HTML is the standard markup language for documents displayed 
in a web browser, defining their meaning and structure. There are 
also other technologies used to describe a web page’s appearance 
(CSS) or behaviour (JavaScript). “Hypertext” refers to links that 
allow users to create, store, and view text, connecting web pages 
directly so that “travelling” from one to another is quicker. Links 
are a fundamental characteristic of the Web. 

HTML uses “markup” to distinguish text, images, and other con- 
tent for display in a Web browser. An HTML element is detached 
from other text in a document by “tags”, which consist of the el- 
ement name surrounded by “<” and “>”. HTML markup includes 
special “elements” such as <head>, <title>, <p>, <button>, and many 
others. The name of an element inside a tag is not case-sensitive, 
that is, it can be written in uppercase, lowercase, or a combination. 
For example, the <title> tag can be written as <Title>, <tItle>, or in 
any other style [8]. 


2.5 Cascading Style Sheet (CSS) 


CSS is a stylesheet language for describing the presentation of ele- 
ments ina HTML or XML (eXtensible Markup Language) document, 
including XML dialects such as SVG (Scalable Vector Graphics), 
MathML (Mathematical Markup Language) or XHTML (eXtensible 
HyperText Markup Language). CSS describes colours, layout, and 
fonts of Web pages, allowing to adapt the presentation to different 
types of devices. CSS is independent of HTML and can be used with 
any XML-based markup language. 

CSS is one of the core languages of the web and is standardized 
across Web browsers according to W3C specifications [9]. Formerly, 
the development of various parts of CSS specification was synchro- 
nous, which allowed the versioning of the latest recommendations. 
There are new versions of CSS such as CSS1, CSS2.1, and CSS3. 
However, CSS4 has never been released. 

Since CSS3, the scope of the specification increased considerably 
with CSS modules differing significantly. Therefore, it became more 
efficient to develop and release recommendations separately per 
module. Currently W3C, as an alternative of versioning the CSS 
specification, takes periodically a snapshot of the latest stable state 
of the CSS specification [10], [11]. 
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2.6 JavaScript 


JavaScript is a dynamic and lightweight scripting language, and 
it has broad participation in website and web application services. 
JavaScript has become one of the most widely used languages 
for Web development, however, many non-browser environments 
also use it, such as Node.js, Apache CouchDB, and Adobe Acrobat. 
JavaScript is a prototype-based, multi-paradigm, single-threaded, 
dynamic language, which is used in web pages interface design, 
creating cookies, mobile apps, games, and so on. 

JavaScript should not be confused with the Java programming 
language. The two programming languages have very different 
syntax, semantics, and uses [12]. The main difference between 
JavaScript and Java is that JavaScript code is written completely 
in text and needs only to be interpreted, while Java, on the other 
hand, must be compiled. 


2.7 Node.js 


Node.js is an increasingly popular event-driven architecture, open- 
source, cross-platform, back-end JavaScript runtime environment, 
widely used in server-side and desktop applications. Node.js ex- 
ecutes JavaScript code outside a web browser and provides an 
effective asynchronous programming model. In Node.js, time- 
consuming IO operations, e.g., file access operations, can be del- 
egated as asynchronous tasks, running in the dedicated threads. 
Thus, Node.js applications are not blocked by these time-consuming 
IO operations. 

Node.js provides an effective asynchronous event-driven pro- 
gramming model and supports asynchronous tasks allowing devel- 
opers to use JavaScript to write command-line tools and produce 
dynamic web page content before the page is sent to the user’s 
browser. Node.js represents a “JavaScript everywhere” paradigm, 
unifying web application development around a single program- 
ming language, rather than different languages for server-side and 
client-side scripts. These design choices aim to optimize through- 
put and scalability in web applications with many input/output 
operations [13]. 


2.8 Integrated Development Environment 
(IDE) 


IDEs provide a convenient standalone solution that supports de- 
velopers during various phases of software development and are 
designed to include all programming tasks in one application. One 
of the main benefits of an IDE is that they offer a central interface 
with the tools that a developer needs, including the following [14]: 


e Code editor: Designed for writing and editing source code, 
these editors are distinguished from text editors because their 
function is to simplify and enhance the process of writing 
and editing code for developers. 

e Compiler: Compilers transform source code that is written 
in a human-readable language into a machine-readable lan- 
guage. 

e Debugger: Debuggers are used during tests and can help 
developers debug their application code. 

e Build automation tools: These tools help developers to auto- 
mate common developer tasks to save time. 
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Additionally, some IDEs may also contain [14]: 


e Class browser: Used to study and reference properties of an 
object-oriented class hierarchy. 

e Object browser: Used to inspect objects instantiated in a 
running application program. 

e Class hierarchy diagram: This allows developers to visualize 
the structure of object-oriented programming code. 


An IDE can be a stand-alone application, despite the fact it could 
be also included as part of one or more compatible applications 
[14]. 


2.9 Methodology 


To compare Low Code development, Java Swing, and JavaScript 
programming, two applications are developed in all technologies. 
These applications are developed and tested on a computer with an 
Intel i5-8250U CPU, Windows 10, 512 GB SSD, and 8 GB of RAM 
(Random Access Memory). 

The first application is a calculator with the basic math opera- 
tions, sum, subtraction, multiplication, and division. The second is a 
simple text editor, like a simple notepad. In the implementations of 
these applications, the following metrics are assessed: development 
time, in minutes; execution runtime, in milliseconds (ms), this is 
the time that the application takes to set up the UI; the number of 
written code lines; and the operations execution time, in millisec- 
onds. The execution runtime and operations execution time that 
is considered is the average of five executions of the application. 
To ensure that the results are viable, in JavaScript and Low Code, 
since an LCDP and its applications run on a browser, Brave browser, 
introduced in subsection 2.9.3 is used in incognito mode to get the 
execution time the browser cache is cleared before each run. Also, 
in Java Swing, before each run, CachemanXP program is executed 
to clean the RAM. 

The calculator has a text area to show the input and results, and 
alongside the basic operation buttons, mentioned above, it has the 
numbers, from 0 to 9, the equal, the delete, the decimal point “”, 
and the clear buttons. The calculator doesn’t give any errors, when 
inserting two operations it must do the first operation and with its 
result do the second (when introduced 5+2*10, the first operation 
is computed, 5+2=7, and then the second operation, 7*10=70), this 
may lead to mathematical miscalculation but it’s not important for 
the evaluation. When there is an operation character in the text 
area, and another is inserted the first one must be replaced (if there 
is have “2+” in the text area and a “-“ is entered, the result is “2-“). 
To get a valid operation, a number must be entered followed by an 
operation and another number and then press the equal button to 
get the result or add another operation. 

The text editor is a simple text area with two buttons, one to 
load and the other to save the text. The text editor must give the 
user the possibility to choose where to save or load the file. The 
load and save functionalities must only allow .txt files. 

The Low Code applications are implemented using Neptune 
Planet 9 LCDP, version 2.3.1, which will be introduced in subsection 
2.9.1. In the Java Swing applications, NetBeans IDE is used, version 
12.5, which is described in subsection 2.9.2. 

The JavaScript applications are implemented with Visual Stu- 
dio Code, version 1.63.2, which is presented in subsection 2.9.4. It 
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should be noted that in pure JavaScript, it is not possible to let the 
user choose where to save a file, because it is always saved in the 
Downloads folder [15]. These platforms are chosen considering 
our knowledge about these types of platforms. NetBeans is chosen 
because it is one of the most user-friendly IDE, however, it does not 
support JavaScript. Due to this fact, with JavaScript, Visual Studio 
Code is chosen, which is one of the best IDEs, based on their usage 
and popularity [16]. 

The description of tests execution is presented in Tables 1 and 2 
for the calculator and text editor, respectively. The tests are based 
on the following operations: 


e “Add number” operation includes adding a number and the 


9 


” to the text area. 

e “Add operation” adds a mathematical operator like “+” to 
the text area. 

e “Add operation (S)” is the same as “Add operation”. However, 
when there is already an operation character this is substi- 
tuted, for example, when there is “1+” in the text area anda 

“-“ is introduced, the result is “1-“. 

e “Add operation (R)” is the same as “Add operation”. However, 
when there is already a valid equation its result is calculated. 
For example, “1+1+” turns “2+“. 

e ’Delete” operation deletes a character of the text area. 

“Delete (N)” operation is when the text area is empty and 

the delete button is pressed. 

“Clear” operation clears the whole text area. 

“Result” operation calculates the result of the equation in 

the text area. 

e “Result (N)” is the same as the result operation but when the 
equation is not valid, for example when there is a “1+” in the 
text area the equation is not altered, the value “1+”, in the 
text area stays unchanged. 

e “Load” operation loads the text from a .txt file to the text 
area 

e “Save” operation saves the text in the text area into an ex- 
isting .txt file. This is not possible to evaluate on JavaScript 
text editor, because when trying to save on an existing file, 
the browser creates a new file by adding “(1)” to the new file 


name. 


“Save (N)” operation saves the text in the text area into a 
new .txt file. 


Some observations: 1) The calculator tests start with the calculator 
clear of values. 2) In the calculator tests the time to select a button 
is not considered, only the time of the operation. 3) The text editor 
tests do not take into consideration the time to write a text, and the 
same text is used for all tests. 4) In the text editor tests the time to 
select a file is not considered, only the time to save/load a file. 


2.9.1 Neptune Planet 9. Neptune Planet 9 is an LCDP that uses as 
core technologies HTML, and CSS, and uses Node.js as the program- 
ming language. Its architecture is shown in Figure 2 and Figure 
3. 

The development of applications is done in the App Designer 
component, where data and resources from the Store, ODATA, 
Media Library, API Designer, and Server Scripts are used. Table 
Definition is used to define data types, which is not always neces- 
sary, as these types can be automatically imported from an external 
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database. The LaunchPad serves as a Modules portal, where each 
developed module is added, which is possible with the use of Tiles 
that allows navigation for each module. Tiles are organized into 
groups through Tiles Groups. Users are configured in the Users 
component and can be created in the LCDP or obtained externally. 
In the LaunchPad each user has access to the modules depending 
on their role. Finally, the LaunchPad can run on the Web or in an 
APK (Android Package) that can be generated in the Mobile Client 
component for mobile devices [6]. 

Modules can be created from a workflow using the Workflow 
Designer component and the Theme Designer component can be 
used to create a predefined theme for the entire LaunchPad to 
visually enhance it [6]. 


2.9.2 NetBeans IDE.. NetBeans is a free IDE where a programmer 
can develop applications in languages like Java, C, C++, PHP, and 
others. This IDE supports many platforms such as Windows, Linux, 
Solaris, and macOS, and it supports many types of API services. 
NetBeans like many other IDEs allows a programmer to develop 
many kinds of applications, from a plain text editor to a complex 
web app [17], [18]. 


2.9.3 Brave Browser. Brave is a free and open-source web browser 
developed by Brave Software, Inc. based on the Chromium web 
browser. Brave’s popularity is on the increase, driven by privacy- 
by-default functionality, which automatically blocks online adver- 
tisements and website trackers. Brave is developed upon the open- 
source Chromium browser project which promotes faster and safer 
browsing. As the project is open source, Brave can make use of 
the code for their product, adding additional features on top. Pri- 
vacy features include ad-blocking, antitracking functionality, and 
cryptocurrency offerings [19]. 


2.9.4 Visual Studio Code. Visual Studio Code is a cross-platform 
editor implemented by Microsoft for Windows, Linux, and macOS. 
In 2016, Visual Studio Code has progressed from the public preview 
stage and was released to the Web. Then, it has quickly become 
one of the top editors in terms of the popularity. 

Visual Studio Code is a very powerful code-focused development 
environment expressly designed to make it easier to write web, mo- 
bile, and cloud applications using languages that are available to 
different development platforms and to support the application de- 
velopment lifecycle with a built-in debugger and integrated support 
for the popular Git version control engine [20]. 


3 EXPERIMENTAL RESULTS 


This section presents the experimental results of the developed 
applications. 

Figures 4, 5 and 6 presents the user interface of the calculator 
using Java Swing, Low Code, and JavaScript, respectively. 

Figures 7, 8 and 9, presents the user interface of the text editor 
using Java Swing, Low Code, and JavaScript, respectively. 

Figure 10 presents the text editor example of the fileChooser 
window for load operation. 

Table 3 presents the results of the development of each applica- 
tion concerning development time, execution runtime, and hand- 
written code lines. In Tables 3, 4, and 5, Java Swing is referred to as 
JSW, Low Code as LC, and JavaScript as JSC. 
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Table 1: Calculator test execution 


Operation 


Description 


Add number 
Add operation 


Add operation (S) 


Add operation (R) 


Delete 


Delete (N) 


1. Click on a number (random) 
2. Read time of operation (1) 
1. Click on an operation (random) 
2. Read time of operation (1) 
1. Click on an operation (random) 
2 Click on another operation (random) 
2. Read time of operation (2) 
1. Click on a number (random) 
2. Click on an operation (random) 
3. Click on a number (random) 
4. Click on an operation (random) 
5. Read time of operation (4) 
1. Click on a number/s or operation/s (random) 


2. Click on delete 


3. Read time of operation (2) 


1. Click on delete 


2. Read time of operation (1) 


Clear 1. Click on a number/s or operation/s (random) 


Result 


Result (N) 


Table 2: Text editor test execution 


Operation Description 


Load 1. Click on Load 
2. Select a file 

3. Read time of operation (2) 
Save 1. Write the text 
2. Click on save 

3. Select an existing file 

4. Read time of operation (3) 
1. Write the text 
2. Click on save 

3. Choose file and name location 
4. Read time of operation (3) 


Save (N) 


Table 4 presents the average runtime of each operation of the 
calculator, in milliseconds. 

Table 5 presents the runtime for each operation of the text editor 
in milliseconds. 
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2. Click on clear 
3. Read time of operation (2) 


1. Click on a number (random) 
2. Click on an operation (random) 
3. Click on a number (random) 


4. Click on equal 
5. Read time of operation (4) 


1. Click on a number/s or operation/s (random) 


2. Click on equal 
3. Read time of operation (2) 


4 DISCUSSION OF THE EXPERIMENTAL 
RESULTS 


This section presents the discussion of the results of the previous 
section. To discuss these results it must be considered that Low Code 
and manual programming with HTML, CSS, and JavaScript are 
similar except Low Code generates HTML and CSS automatically. 


4.1 Discussion of the Results: Comparing Low 
Code with Java Swing and JavaScript 


As shown in Table 3 Java Swing and JavaScript applications have 
a better performance but at a bigger cost in terms of development 
time and hand-written code, compared to Low Code applications. 
Nevertheless, in a deeper analysis: 


e In terms of development time, developing in Low Code is in 
average, 1.74 times faster than programming in Java Swing 
and 1.10 times faster than programming in JavaScript: 

© In the Low Code calculator, development is 1.73 times 
faster than Java Swing. 

© In the Low Code text editor, the development is 1.75 times 
faster than Java Swing. 

© In the Low Code calculator, development is 1.06 times 
faster than JavaScript. 
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Figure 2: Neptune Planet 9 architecture, resources and tools. Source: [6]. 


Table 3: Results of the experiments 


Application / Property Development time (minutes) Execution runtime (ms) Hand-written code lines 
Calculator (JSW) 57 156.34 72 
Calculator (LC) 33 1082.60 64 
Calculator (JSC) 35 37.40 106 
Text Editor (JSW) 14 790.63 39 
Text Editor (LC) 8 1494.60 12 
Text Editor (JSC) 10 33.00 27 
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Aun, Manage & Secure 
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Figure 3: Neptune Planet 9 architecture, run, manage & secure and administrate. Source: [6]. 


Table 4: Results of the calculator operations in ms using JSW, LC, and JSC 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


Operation Calculator (JSW) Calculator (LC) Calculator (JSC) 
Add number 3.14 0.56 0.14 
Add operation 3.25 0.52 0.16 
Add operation (S) 1.05 0.74 0.06 
Add operation (R) 1.78 0.80 0.48 
Delete 0.42 0.26 0.10 
Delete (N) 0.10 0.44 0.12 
Clear 0.10 0.38 0.22 
Result 1.26 0.22 0.60 
Result (N) 0.21 0.44 0.16 


Table 5: Runtime results for the text editor (in ms) using JSW, 


LC, and JSC 


Operation Text editor Text editor 


Text editor 


(sw) (LC) (JSC) 
Load 37.33 0.30 0.34 
Save 5.66 0.92 Not possible 
Save (N) 7.13 0.86 1.02 
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© In the Low Code text editor, the development is 1.25 times 


faster than JavaScript. 


e Interms of execution runtime, Java Swing applications are in 
average, 2.72 times faster and JavaScript applications 36.61 


times faster, when compared to Low Code applications: 


© In Java Swing, the calculator, runtime is 6.92 times faster 


than Low Code. 


© In Java Swing, the text editor runtime is 1.89 times faster 


than Low Code. 
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Figure 4: Java Swing calculator based on NetBeans. 


Calculator 


Figure 5: Low Code calculator based on Neptune P9 and 
Brave browser. 
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Figure 6: JavaScript calculator based on Visual Studio Code 
and Brave browser. 
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Figure 7: Java Swing text editor based on NetBeans. 
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Figure 8: Low Code text editor using Neptune P9 and Brave 
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Figure 9: JavaScript text editor using Visual Studio Code and 


Brave. 
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Figure 10: Load operation fileChooser using NetBeans. 


© In JavaScript, the calculator, runtime is 28.95 times faster 


than Low Code. 


© In JavaScript, the text editor runtime is 45.29 times faster 


than Low Code. 
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e In terms of hand-written code lines, Low Code applications 
have in average, 2.18 times less code, compared to Java Swing 
and 1.75 times less code, compared to JavaScript: 

© The low Code calculator has 1.13 times fewer code lines 
than Java Swing. 

© The low Code text editor has 3.25 times fewer code lines 
than Java Swing. 

© The low Code calculator has 1.66 times fewer code lines 
than JavaScript. 

© The low Code text editor has 2.25 times fewer code lines 
than JavaScript. 


As presented in Table 4 Low Code calculator has a better perfor- 
mance in terms of operation execution runtime, compared to the 
Java Swing calculator, and it is similar to the JavaScript calculator. 
Considering all the operations: 


e In average the calculator operations are 2.14 times faster in 
JavaScript than in Low Code and 2.59 times faster in Low 
Code than in Java Swing. 

e Java Swing calculator is faster in three operations “Delete 
(N)”, “Clear” and “Result (N)”. This is because Java is usually 
faster and, in this case, there is only one line of code to 
execute (in “Delete (N)” there’s an if, in “Clear” there’s a 
set value, in “Result (N) there’s an if), and the difference, 
compared to Low Code, is 0.34, 0.28, and 0.23, respectively. 
In each operation there exists a negligible difference, thus 
considering that it is much slower in other operations. 


As shown in Table 5, the Low Code text editor has a better 
performance in terms of operation execution runtime. For example: 


e “Load” operation is 124.43 times faster in Low Code than in 
Java Swing, and 1.13(3) times faster than JavaScript. 

e “Save” operation is 6.15 times faster in Low Code than in 
Java Swing. 

e “Save (N)” operation is 8.20 times faster in Low Code than 
in Java Swing and 1.19 times faster than in JavaScript. 


The Low Code text editor is in average of execution runtime 
of operations, 24.10 times faster than the Java Swing text editor, 
and 1.17 times faster than the JavaScript text editor. Low Code and 
JavaScript are faster than Java Swing in this case because unlike 
JavaScript, Java Swing needs to understand what operating system 
it is running on to make a system call, JavaScript does not need 
that because it is managed by the browser [21]. 

Comparing only Java Swing and JavaScript depends on the ap- 
plication’s complexity and nature and it is presented in the next 
subsection. 


4.2 Discussion of the Results: Java Swing with 
JavaScript 
As presented in Table 3, JavaScript applications have better per- 


formance and take less time to develop than Java Swing but need 
more handwritten code lines. Making a deeper analysis: 


e In terms of development time, developing in JavaScript is, in 

average, 1.58 times faster than programming in Java Swing. 

© In the JavaScript calculator, development is 1.63 times 
faster. 
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Table 6: Operations average for all experiments 


Operation Java Swing Low Code JavaScript 
Code lines 55.50 38.00 66.50 
Execution 473.49 1288.60 35.20 
runtime 
Development 34.00 20.50 22.50 
time 
Operations 5.12 0.54 0.310 
runtime 


© In the JavaScript text editor, the development is 1.40 times 
faster. 
e In terms of execution runtime, JavaScript applications are, 
in average, 13.45 times faster than Java Swing applications. 
© In JavaScript, the calculator, runtime is 4.18 times faster. 
© In JavaScript, the text editor runtime is 33.96 times faster. 
e In terms of hand-written code lines, Java Swing applications 
have, in average, 1.20 times less code, compared to JavaScript. 
© In the Java Swing calculator, the number of code lines is 
1.47 times less. 
© In the Java Swing text editor, the number of code lines is 
1.44 times less. 


As shown in Table 4, the JavaScript calculator has a better per- 
formance in terms of operations execution runtime, compared to 
the Java Swing calculator. Considering all the calculator operations: 


e In average the calculator operations in JavaScript are 5.54 
times faster than in Java Swing. 

e Java Swing calculator is only faster in two operations “Clear 
and “Delete (N)” and the difference is 0.02 and 0.12 millisec- 
onds, respectively, which is a negligible difference, consid- 
ering that Java Swing is much slower in other operations. 
JavaScript calculator is 5.54 times faster than Java Swing 
calculator in average of runtime execution of all operations. 


As presented in Table 5 JavaScript text editor has a better perfor- 
mance in terms of operation execution runtime. Making a deeper 
analysis: 


e “Load” operation is 109.79 times faster in JavaScript than in 
Java Swing. 
e “Save” operation is not comparable because it cannot be 
tested in the JavaScript text editor. 
e “Save (N)” operation is 6.99 times faster in JavaScript than 
in Java Swing. 
The JavaScript text editor is 32.69 times faster than the Java 
Swing text editor considering the average execution runtime of all 
operations. 


4.3 Summary of all Results 


Table 6 presents the average of the operations, and Table 7 the stan- 
dard deviation for all the previous results taking into consideration 
the number of code lines, execution runtime (ms), development time 
(minutes), and operations runtime (ms). These results are obtained 
from Table 3 and from Tables 4 and 5. 
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Table 7: The standard deviation for all experiments 


Operation Java Swing Low Code JavaScript 

Code lines 16.50 26.00 39.50 

Execution 317.15 206.00 2.2 
runtime 

Development 20.00 12.50 12.50 
time 

Operations 9.95 0.23 0.28 

runtime 


With the results of Tables 6 and 7, the differences between these 
technologies can be seen more clearly: 


e Low Code is 9.48 times faster than Java Swing and 1.74 times 
slower than JavaScript, in terms of operation runtime. 

e JavaScript is 16.52 times faster than Java Swing, in terms of 
operation runtime. 

e JavaScript has the lowest standard deviation compared to 
the other technologies, except in the number of code lines, 
where Java Swing has the lowest standard deviation. 


Note that the values comparing these technologies may vary due 
to the application’s complexity and its nature. 


5 OPEN PROBLEMS 


This section presents the problems addressed in this paper. The 
comparison of Low Code, Java Swing, and Java Script technologies 
can be addressed in many ways. In this paper, each technology was 
compared by looking at some of its application’s metrics, like execu- 
tion runtime. This practical approach gives developers a perspective 
of how these technologies can help them in their job. 

As described in [1], there is no comparison of Low Code with 
other technologies, which leaves space for this type of research. 
This gap was filled with our research, but there is still significant 
work to be done. We identify the following open research problems: 


e Comparison with other Low Code technologies. 

e Using Low Code with more complex applications including 
database development. 

e Comparing the frontend UI, backend logic, and data store, 
to be developed using Low Code technologies. 

e Using machine learning in Low Code vs machine learning 
in other technologies. 

e Research of Low Code Development Platforms usage for 
communication, human behaviour, and decision-making. 


6 CONCLUSIONS AND FUTURE WORK 


Low Code development is a technology that speeds up the process 
of deploying an application version to the production environment. 
Low Code facilitates programming by diminishing the handwritten 
code and allowing non-programmers to build applications. These 
technologies may also simplify the work of database developers. 
The Low Code development, Java Swing, and JavaScript program- 
ming have been experimentally evaluated and it can be concluded 
that Low Code applications are valuable when what is important is 


IDEAS2022 


André Calcada and Jorge Bernardino 


the development time and writing code. However, their runtime exe- 
cution for the setup of the application is slower than Java Swing and 
JavaScript. Low Code and JavaScript are faster at executing most of 
the operations than Java Swing which leads to the conclusion that 
Low Code and JavaScript applications have a better performance. 
Their performance in terms of operation execution time is very 
similar, which occurs because Node.js is based on JavaScript and the 
applications of these technologies run on the same environment, a 
browser. 

Despite the advantages of Low Code, it must be taken into con- 
sideration that Low Code has a big learning curve. It should also 
be noted that the compared technologies run in different envi- 
ronments, Java Swing runs on the operating system, and the other 
technologies run on a browser, which directly impacts performance. 

It can be concluded that JavaScript is the best programming 
method in terms of execution runtime, although it may be the one 
with a larger number of code lines, depending on the applications. 

As future work is intended to develop a study with other tech- 
nologies, like .Net, and with more complex applications, such as a 
web app that manages users, so there can have a thorough compar- 
ison. 
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ABSTRACT 


The existence of big data, social media interactions, and digital glob- 
alization has changed the way people make decisions either in their 
life or those of collective importance. Computational Social Choice 
(COMSOC), as an emerged field, has tried to join various social 
fields (social choice theory) and technical fields (computer science, 
mathematics, economics and logic). In the last few decades, expert 
rating was used to select the winner in the contest or competition, 
that was, later, merged with crowd voting. However, the results of 
voting based on aggregation of crowd opinion was not considered 
satisfied. The majority judgement is a new method of election. It 
is the consequence of a new theory of social choice where voters 
judge candidates instead of ranking them. In this research, we used 
Eurovision song contest data of 2021 final round. Eurovision song 
contest is held annually, in which almost 40 countries participate. 
We applied majority judgement on the Eurovision song contest 
data and found that Italy got highest position, followed by Croatia 
and Australia acquiring second and third positions respectively in 
competition of 2021. 


CCS CONCEPTS 


- Information systems — Data analytics; «Computing method- 
ologies — Distributed artificial intelligence; «Computer sys- 
tems organization — Robotic autonomy; + General and refer- 
ence — Surveys and overviews, « Networks — Wireless local area 
networks; « Hardware — Sensors and actuators; Wireless devices. 
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1 INTRODUCTION 


There is a paradigm shift towards Computational social choice after 
the big data changes the way of data collection and processing. 
To make correct decision by using the information acquired from 
the big data is a challenge [17], [42]. Computational social choice, 
also known as COMSOC, has emerged as a interdisciplinary field 
during the last few decades and it acts as a bridge between the 
social science and technical science as well as between the classic 
and modern problems [3]. It draws knowledge from social choice 
theory, computer science, mathematics, economics and logic [31]. 
The social choice theory focuses on how the individual votes are 
aggregated in such as way that the society make the collective deci- 
sions [28] [33]. COMSOC introduces an algorithmic concept in the 
social choice theory where the main focus in on the computational 
and algorithmic perspectives of voting challenges [10]. The voting 
process is the central area of COMSOC. According to [17], the most 
known voting rules are: 


e Majority: The candidate with majority votes is the winner 

e Plurality: The candidate must have more votes as compared 
to other candidates 

e Borda count method: The candidates with highest score is 
the winner 

e K-approval: The candidate with more approvals is the winner 

e Copeland’s method: The candidate with the highest number 
of pairwise victories 

e Veto: The candidate with the low negative score is the winner 


The electing and ranking candidates has been a challenge since 
many decades [7], [15]. All around the world, different nations, 
societies, organizations, communities elect their representatives 
as well as juries or committees rank the employees, singers, dri- 
vers, artists, students, universities, hospitals, nominees for different 
awards, movies, beauty etc. [23] [27]. With the development of 
intelligent DSS (decision support system), the degree of uncertainty 
during election and ranking can be reduced [13] [16], [41]. In past 
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decades, the small group of people, known as experts, used to select 
the best one from the pool of candidates and this mechanism is 
called "expert rating". Recently, instead of relying on few experts, 
one can easily get the opinion of the community users publicly 
for the selection of winner. This approach is called crowd sourcing 
[12]. Traditionally, the experts are professional people who have 
specialized knowledge and exposure about their particular field. 
However, the crowd are a number of people from different back- 
grounds and exposures [25]. Today, with the help of social media 
and other platforms, it is easy for the crowd to share their opinion 
over the internet [14]. In some cases and fields, the crowds opinion 
is more excellent as compared to the experts [43]. 

Another ranking methods theory is "Condorcet’s Paradox" devel- 
oped by French philosopher Nicolas de Condorcet, in 18th century. 
According to Condorcet Jury theorem, if each member of a jury has 
an equal and independent chance better than random, but worse 
than perfect, of making a correct judgment on whether a defendant 
is guilty (or on some other factual proposition), the majority of ju- 
rors is more likely to be correct than each juror and the probability 
of a correct majority judgment approaches as the jury size increases 
[8]. It is observed that the majority preferences can be termed as 
irrational even though the individual preferences are rational [9], 
[24]. Kenneth Arrow developed another ranking theorem in 20th 
century [19]. He proved that there exists no method of aggregating 
the preferences of two or more individuals over three or more alter- 
natives into collective preferences, where this method satisfies five 
seemingly plausible axioms: (1) universal domain; (2) ordering; (3) 
weak Pareto principle; (4) independence of irrelevant alternatives 
and (5) non-dictatorship. 

Balinski and Laraki introduce new voting system called, Major- 
ity Voting (MJ) [21] The majority judgement is a new method of 
election where the jury or voters only judge the candidate instead 
of ranking them. It is a consequence of a new theory of social choice 
[6]. In majority judgement, the the jury of judges or voters do not 
vote themselves, but they evaluate the candidates in some common 
formats of grades. The majority judgement gives the solution of 
search for an optimal methods of competition or election provided 
that the merits of competitors are to be evaluated [4]. The majority 
judgement has been practices in competitions around the world 
as well as in political scenarios. The grades given by voters are 
considered as input while the majority grade for each candidate is 
produced as output. This majority grade is used to rank the order 
of candidates or to select a winner [44]. The majority grade is the 
new process, which ensures that if the majority of the voters gives 
grade B to any candidate, its majority vote will be B. However, if ev- 
eryone of the majority give grade C to the candidate, that candidate 
majority grade will certainly not be C [35]. 

The Eurovision Song Contest (ESC) is an annual event and was 
started in 1955 and held in Lugano, Switzerland in 1956 for the 
first time with seven participants countries. Later, in 1961, the 
number of participant countries increased to 16. Now-a-days, the 
non-European countries can also participate in the context such as 
Israel, Morocco, and Turkey, which have become the regular partic- 
ipant now. Almost several hundred millions of people watch the 
contest [26]. In the Eurovision song contest, the participant coun- 
tries give votes for the selection of winner. However, there exist the 
suspicions and accusations of ‘tactical’ and ‘political’ voting in the 
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song contest [39]. Eurovision is the most widely viewed event in all 
the candidate countries. It is also covered by the press as being the 
most famous international contest. The winner country celebrate 
the victory nationally. Often the leader of the winner country con- 
gratulate to the whole team and the whole nation for the victory 
and highlights the importance of victory towards their country and 
national morale [1], [22]. In Eurovision song context, the voting 
system was modified several times. In starting, the national juries 
used to select the winners. Later, in 1997, televoting was introduced. 
Then, from 2009, a hybrid system was introduced containing the 
popular vote and a jury vote to select the winner. In 2016, it was 
again revolutionized where the votes of jury and televoting were 
arranged separately [40]. Nevertheless, we believe that applying 
majority judgement in Eurovision song context can result in more 
fair voting. 


2 RELATED WORK 


Many researchers worked in the field of complexity of query evalu- 
ation. In [2], the researchers computed possible and certain answers 
using partial order theory and tried to reduce the complexity in 
their particular topic. In [31], a systematic investigation query eval- 
uation on election databases was carried out by the researchers. 
They analyzed the interaction between the partial preferences, the 
voting rules and the relational context impacts on the complexity of 
query evaluation. Moreover, the researchers studied the computa- 
tional complexity of the evaluation problem for approval voting and 
positional scoring rules regarding PPIC, the Mallows noise model, 
and EDM [7]. The researchers in [11], worked on the complexity 
of the possible winners (PW) on partial chains considering partial 
order theory. The article [10], investigated the practical aspects of 
computing the necessary and possible winners in elections over 
incomplete voter preferences. Dixit et al. used SAT solving i their 
research and investigated the aggregation queries over inconsistent 
databases [16]. 

The challenges of computational social choice for voting in multi- 
agent systems have been described in [17] while the authors in [18] 
explained the advantages and disadvantages of the crowd voting 
and expert voting. Imber et al. studied the complexity of some of the 
computational problems for the classic approval-based committee 
voting rules [27]. In [28], the authors explored the evaluation of 
similarity of voting procedures. [29], the authors studied the com- 
plexity of estimating the probability of an outcome in an election 
over probabilistic votes. Kovacevic et al. used the machine learn- 
ing for the fusion of crowd and export opinion in crowd voting 
environment [32] while [43] used the machine learning along with 
crowd voting for the stock model selection. Various authors work 
on the preference aggregation as well [36], [34]. A comparison of 
the studies related to COMSOC is given in Table 1. 
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Table 1: State-of-the-art in the field of COMSOC 


REF Methodology Application Theory Type Solution 

[2] Conceptual Computing possible and certain | Partial order theory Complexity 
answers 

[7] Conceptual Predicting election outcomes | Mallows noise model Complexity 
and estimating their robustness 

[11] Conceptual Possible winners on partial | Partial order theory Complexity 
chains 

[10] Experimental Necessary and possible winners | Partial order theory Intractability 

[16] Conceptual Answering aggregation queries | N/A N/A 

[17] Comparative Voting in multi-agent systems | N/A N/A 

[18] Conceptual Votes from crowd and knowl- | Collective decision Decision 
edge from experts 

[27] Conceptual Committee voting Collective decision Complexity 

[28] Conceptual Multi-agent systems and voting | Condorcet theory Polynomial 

time 

[29] Conceptual Outcomes in probabilistic elec- | N/A Probability 
tions 

[30] Conceptual COMSOC meets databases Partial order Complexity 

[31] Conceptual Query evaluation in election | N/A Complexity 
databases 

[32] Experimental Fusion of crowd and experts in | Collective decision N/A 
crowd voting 

[33 Conceptual Decision making under incom- | Collective decision Efficiency Sta- 
plete knowledge bility 

[38 Conceptual Online elicitation of necessarily | Partial order Complexity 
optimal matchings 

[19 Conceptual Multi-robot task allocation | Arrow’s theorem N/A 
problem 

[12 Conceptual Crowd voting on participation | Expectancy theory and | Contest’s 
in crowdsourcing contests tournament theory reliance 

[15 Conceptual Frugal bribery in voting N/A Polynomial 

time solvable 

[14 Experimental Online review sites affect collec- | Collective decision N/A 
tive decision making 

[13 Conceptual Information and analytical col- | Collective decision Efficiency 
lective decision-making 

[34] Computational social choice | Preference reasoning N/A 
and preference reasoning 

[36] Experimental Making group decisions Collective decision Intractability 

[20] Conceptual Support decision making on | Collective decision Leveraging de- 
university program selection cision making 

[43] Experimental Stock selection model Collective decision Stock recom- 
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Ranks ot Candidates (3) If the new value of a of any previously tied candidate is 


higher than the others, that candidate is selected as winner. 


1* Candidate 
24 Candidate 
34 Candidate 
4% Candidate 


If there is still tie between the new values of a, the process 
should be repeated from step one. 


Combining votes 4) 


rN 


Eurovision Majority 
Dataset Judgement 


Figure 1: Proposed Framework 


3 METHODOLOGY 


The overall methodology used in this research us given in figure 1. 
We first of all get the jury votes and expert votes and combine them 
manually. A very careful approach was kept while adding the votes 
manually. Then, majority judgement was applied over the newly 
created dataset of Eurovision Context. The majority judgement 
provided the ranks of all the participating countries. 


3.1 Dataset 


In this research, we have used the Eurovision 2021 song competition 
dataset. The 65th edition of Eurovision song competition was held 
in 2021, in which 39 nations participated. Only 26 nations were 
selected for the final round. The final score was the aggregate of 
jury and televotes from all the 39 participants. The data in the rows 
shows the votes given by each member to the finalists. The method 
of decision was that top 12 songs were selected and to be ranked in 
such as way that the 12 points must be for the best song and the 1 
point for the 12th choice, while other participants get 0 points. Self 
voting was not allowed in the competition. 


3.2 Majority Judgement 


In this research, we ranked the candidates of Eurovision song con- 
text using majority judgement method. The majority judgement 
method, a new method of voting, was proposed by Balinski and 
Laraki in 2007 [5]. It demonstrates that Majority Judgment is the 
only method that meets a whole set of criteria that have been de- 
veloped over several centuries in the field of “social choice theory”. 
It avoids the famous paradox of Jean Antoine Caritat, Marquis de 
Condorcet, and its generalization to Kenneth Arrow’s "impossi- 
bility" theorem, by considering the problem of voting differently. 
Instead of imagining that a voter has an ordered list of candidates 
in their head (which this experiment belies), it is assumed that they 
can rate each candidate directly. 

The purpose of this voting procedure is to selection one out of 
n candidates (n > 2)?. In song contexts, each judge or the voter 
awards a certain grade to the participants, that is measure on ordi- 
nal scale. These ordinal grades are expressed in numbers such as 
1,2,3,4.....10 or in words such as excellent, good, fair, acceptable, bad, 
poor. The next step is to determine the median grade. Median grade 
is the median of the all the grades of the candidates. The candidate 
with the highest median grade is selected as a winner. In the case 
of tie breaking between the median grades, a, [5] defined some tie 
breaking rules: 


(1) Delete the a from the tied candidates. 
(2) Compute the new value of the median grade, a. 
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nth Candidate 


The majority value of each candidate is assigned in the form of 
ordered triple [6]. 


a@= median grade of the candidate or majority grade 


(p, @*,q)where: 4 p= number of grades above the majority grade 


q= number of grades below the majority grade 


The * can be 0, positive or negative depending upon the values 
of p and q. 
a’, ifp>q 
a= 50, ifp=q 
a, ifp<q 
if there are two candidates suppose A and B, who have majority 
value of (pa, @,4, qa) and (pg, ap, gp) respectively. A ranks higher 
than B and (pa, a, qa) ranks higher than (pz, ag, gp), if 
e A’s majority grade is higher than B’s (or @%, > a) 
e Both of them have a* majority grade and p, > pg 
e Both of them have a majority grade and qa < qp 


The winner is the applicant who has a maximal majority-value 
triple in this ordering. If there exists numerous such applicants, 
then probably the winner ought to be selected through lottery from 
amongst them. 


4 EXPERIMENTS, RESULTS AND DISCUSSION 


We have used the Ranky library of python to implement the ma- 
jority judgement [37]. We consider the list of candidate countries 
C=(c1,c2,...c39) and a list of voter countries V=(v1,v2,...v26). A score 
matrix M of size n x m is obtained by scoring the performance of 
each finalists country by each voter country (C is the list of rows of 
M and V is the list of columns of M). All the candidates countries 
were scored using the same procedure i.e. majority judgement. The 
challenge is to obtain the single rank of each country using the r = 
rank(f(M)) from score matrix. 

The results of the experiments are given in Table 2, where Italy 
stands first, Croatia stood second and Australia got third position. 


Country | Max. Rank | Position 
Italy 4.5 1st 
Croatia 3.5 2nd 
Australia | 3 3rd 


Table 2: Results of Eurovision song context using majority 
judgement 


Figure 2 shows the rank of each participating country. It is clear 
from the figure that the winner country stands at the highest point. 
Considering the results, and our dataset, we concluded that Italy is 
the country with highest rank, followed by Croatia and Australia 
with second and third position respectively. The figure 3 gives the 
visualization of preference matrix (2D). 
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Figure 2: Ranking of participants of Eurovision song context 
using majority judgement 


While discussing the results of voting, we should also consider 
the two important factors. 

1. Cultural and linguistic similarities and differences: First 
is cultural and linguistic similarities and differences, which means 
that common musical taste of two or more than two nations results 
in the strong preference of voting for a particular country. There 
is possibility that two countries might have similar traditions or 
culture and they are familiar with each other’s rituals very well. So, 
in such as case, there are greater chances of likeness between such 
countries. Moreover, language understanding in song context also 
matters alot. People will definitely like those songs which they fully 
understand and hence, can enjoy the lyrics. Therefore, the countries 
with same or similar languages can have a higher tendency to vote 
each other. 

2. Voting bias Voting bias is also a relative phenomena, because 
geographical factors or issues strongly affect the behaviour of voter 
countries in the Eurovision song context. It was observed that many 
countries like or dislike the songs of their neighbouring countries. 
This can lead to the fact that geographical effects cause political 
voting in the contest [39]. 


4.1 Remarks of the MJ procedure in our case 
study 


There are several advantages of using majority judgement in our 
case study: 


All voters were considered equally 

All candidates are considered equally 

If a candidate A is awarded highest grades from all the voters, 
then A is the winner 


All candidates are ranked in transitive order 

A winner A is still a winner if a candidate y is removed 

If the grades of the winner A is increased, s/he will again a 
winner. 


The winner A is still a winner, if a new candidates is added 
with the identical grade distribution that of A or other can- 
didate. 

It tends to reduce manipulation in voting. 

Conceptually, the motivations of electorate and their sat- 
isfaction are modelled with the aid of their “utilities.” The 
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Figure 3: Visualization of preference matrix (2D) 


utility feature of a voter is complicated and absolutely un- 
known. It’s miles probable to imagine that a voter would 
like a candidate’s final grade to be as close as feasible to 
the grade he or she believes the candidate deserves, but it is 
not always so! Within the “plausible” case, the candidate’s 
utility function is absolute, on the other hand it will become 
relative, i.e, what matters are the candidates’ very last rank- 
ings now not their very last grades. However anyways, the 
majority judgement mechanism makes no assumptions any- 
thing about the voters’ utilities. It depends simplest on what 
may be recognised in practice. It is far “strategy-proof” for 
huge training of affordable software functions, and, when 
not anything is understood approximately them, it excellent 
combats strategic manipulation. 


117 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


e Some critics have averred that a voter need to be forced 
to “make up his or her mind” by way of expressing a clear 
reduce desire among any two applicants. 

e The language is ordinal, that is why, the method used needs 
to be ordinal as properly. The majority judgement is ordinal. 
Methods which are based on sums or averages of points are 
not ordinal. 


5 CONCLUSION 


The contributions made in this article can be summarized as follows: 


e We highlighted the importance of COMSOC, discussed the 


various voting methods, specially majority judgement method. 


e We proposed the use of the majority judgement, a new meth- 
ods of voting, for ranking the candidates of Eurovision song 
contest 2021. 

e We discussed the results obtained by using the majority 
judgement method for the selection of winner in the song 
contest. 


The voting techniques discussed and used in this article may find 
applications in other research problems where ranking or selection 
is required. Such kind of research aims in bringing together the 
computational social choice and the ranking/selection problems 
such as winner selection in art competition etc. In future, we tend 
to investigate the biased behaviour in voting using some novel 
techniques of artificial intelligence and deep learning. 
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ABSTRACT 


Besides a good prediction a classifier is to give an explanation how 
the input data is related to the classification result. There is a gen- 
eral agreement that logic expressions provide a better explanation 
than other methods like SVM, logistic regression, and neural net- 
works. However, a classifier based on Boolean logic needs to map 
continuous data to Boolean values which can cause a loss of infor- 
mation. In contrast, we design a quantum-logic-inspired classifier 
where continuous data are directly processed and the laws of the 
Boolean algebra are maintained. As a result from our approach we 
obtain a CQQL condition which provides good insights into the 
relation of input features to the class decision. Furthermore, our 
experiment shows a good prediction accuracy. 


CCS CONCEPTS 


- Computing methodologies — Machine learning algo- 
rithms; Vagueness and fuzzy logic. 
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1 INTRODUCTION AND RELATED WORK 


The general classification problem is given by the following com- 
ponents. Let D be a set of objects. Every object o is character- 
ized by its values for n given attributes: 0 = (01,...,0n). Let 
Cl = {class,,...,class,} be a set of k classes and m be a mapping 
from D to Cl (m: D — Cl). The mapping is typically not explicitly 
known for all objects and is the learning target. Let O C Dbea 
subset of D where for every object the class membership is known. 
That is, we hold M = {(0, m(o))|o € O}. Let TR C M bea set of 
training objects and TE = M \ TR be the test set. 

The classification problem is to construct a mapping function 
cl: D —> Cl, called a classifier, from given training objects TR. The 
classifier should provide a good prediction on TE. The accuracy of 
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a classifier can be quantified as the fraction of correctly classified 
objects of all test objects: 


|{(0, m(o)) € TE|m(o) = cl(o)}| 


accuracy = 


|TE| 
In the following, we reformulate the initial k-class classification 
problem to k one-class classification problems for i = 1,...,k: 


m:D — {0,1} 
mi(o) | 


1 if m(o) = class; 
0 otherwise. 


Thus, we need to determine the classifiers cl; : D — {0,1} with 
high accuracy. Next, we will focus on i = 1 and will write for short 
m and cl instead of m, and cl. 

Besides accuracy, a good classifier cl should provide a good 
human understanding [3] of the relation between an object o and 
its corresponding class m(o). The focus of the following work is on 
human understanding of the relation by means of logic. 

There are well-known standard classifier methods like SVM, 
Bayesian, k-NN, neural network, and decision tree, see [1]. While 
the listed classifier methods provide fairly good accuracy results 
for many applications only the decision tree is seen as a classifier 
providing a good understanding in many cases. Decision tree nodes 
and edges refer to Boolean conditions and a path corresponds to the 
conjunction of Boolean conditions. Thus, the decision tree is based 
on Boolean logic decisions [4]. There is the general agreement that 
logic-based classifier methods provide a better understanding of the 
classification process rather than methods without being founded 
in logic [10]. In the following text we focus on logic-based classifier 
methods. 

Non-classifier approaches for achieving a good understanding of 
the relation between input objects and binary decisions are Qualita- 
tive Comparative Analysis QCA and its refinement fsQCA [9]. Both 
are logic-based and try to find necessary and sufficient conditions for 
a binary decision. fsQCA stands for fuzzy set QCA. The logic-based 
methods decision tree, QCA, and fsQCA use atomic, Boolean condi- 
tions on input data that are mainly realized by Boolean comparisons 
with thresholds. Their resulting Boolean values are combined, evalu- 
ated and analyzed. As a general problem, this kind of input mapping 
of typically real numbers to binary Boolean values can cause a loss 
of input information. Thus, interactions between input reals can 
only be expressed on Boolean level but not directly on value level. 
We will call such classifier methods input thresholds methods. 

As example, we visualize in Fig. 1 input thresholds for a non- 
linearly separable problem: a conjunction 


tho.5(x1) A tho.5(x2) 


where 
1 ifx>r 
0 otherwise. 


th-(x) = | 
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Figure 1: Boolean Logic: input thresholds in the unit cube [0, 1]° 


Aad 


Figure 2: Output threshold on x; - x2 (CQQL conjunction): no threshold (left), values below threshold (middle), and applying a 


threshold on output (right) tho.5(x1 - x2) 


An input object (x1, x2) is represented by a point in a horizontal 
plane whereas the vertical axis refers to the classification output 
0 or 1. Interaction between x and y is expressed by the Boolean 
conjunction. We see, that input thresholds refer to axis-parallel, 
vertical cuts of the plane forming blocks which are active (output 
1) or inactive (output 0). Modifications of the input thresholds just 
move block sides along an axis. 

In contrast, other classification methods apply a Boolean map- 
ping as a last step for obtaining a class decision. Those methods are 
typically not based on logic and do not lose information by Boolean 
input mapping but may suffer from missing human understanding. 
We will call them output threshold methods. 

An output threshold on top of a fuzzy conjunction thg.5(x1 A 
x2) := tho.5(x1 * x2) is depicted in Fig. 2. It expresses, that high 
values for x; and x2 correspond to class 1. In contrast to input 
thresholds there is no axis-parallel block structure. The output 
threshold refers to a horizontal cut at a certain height on the vertical 
axis. 

By comparing input thresholds methods with output threshold 
methods we see the effects of modifying threshold values. Because 
of the perpendicular cut planes (vertical vs. horizontal), neither 
output threshold methods subsume input thresholds methods nor 
vice versa. They express different semantics. That is, no method is, 
in general, better than the other one with respect to prediction. The 
method must fit to the the class boundary of the given classification 
problem. 

For output threshold methods we demonstrated a fuzzy logic 
[2] approach. However, fuzzy logic including t-norm and t-conorm 
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from Zadeh [16] and Lukasievicz [5] suffers from violating im- 
portant rules of logic. For example, Zadehs max-function vio- 
lates the law of the excluded third (x V =x = 1) forx = 0.5: 
max(0.5,1 — 0.5) # 1. The product as t-norm, however, violates 
idempotence (x Ax = x # x* = x-x) for x €]0, 1[. Thus, fuzzy logic 
based on those t-norms violates Boolean laws. 

In order to overcome the deficiencies of fuzzy logic, we adopt 
quantum logic and develop a quantum-logic-inspired classifier as 
an output threshold method. First concepts were published in [6]. 
The elements of a quantum logic [8] are projectors, which identify 
subspaces of a Hilbert space. Each projector represents a condition. 
Based on the subset relation and an operation for negation, the 
projectors form an orthomodular lattice. Two projectors p; and p2 
are called commuting if and only if pi * p2 = pe * pi holds. Pro- 
jectors being pair-wisely commuting provide a Boolean sublattice 
[8]. That result from quantum logic is the theoretical foundation of 
our proposed classifier based on QL conditions. The evaluation of 
QL conditions deals with input values from the unit interval [0, 1]. 
That is, we assume for the input the existence of n atomic, unary 
conditions! on values 0; of o = (01,..., On) returning a truth value 
from the unit interval [0, 1]. 

For our quantum-logic-based classifier we identify the following 
characteristics: 


e Logic interpretation provides a good understanding and 
gives us a powerful theory for processing logic expressions. 


They correspond to mutually commuting atoms (projectors) of a distributive and 
hence Boolean QL lattice. 
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e Avoiding input thresholds supports interactions on the level 
of continuous values. 

e In contrast to fuzzy logic the evaluation of our classification 
conditions obeys Boolean laws. 


Thus, we get both interactions on continuous data and Boolean 
laws from quantum logic. For example, by applying Boolean laws 
we obtain in Section 7 the CQQL condition ’G V (A A D A I)’ 
from the PIMA dataset’ about risk of diabetes: There is a high risk 
of diabetes in case of a high plasma glucose concentration (G) or in 
case of a high age (A) together with a strong diabetes pedigree (D) 
and a high level of Insulin (1). 

In following sections we will develop our quantum-logic-inspired 
classifier. It is based on the concepts of CQQL. After introducing 
the main concepts of CQQL, we apply them in order to derive from 
training data weights on minterms (conjunctions of negated or 
non-negated atomic conditions) of a logic expression in disjunc- 
tive normal form. The weights represent the logical expression 
and relate the input data to the classification result. For a binary 
classification result we compute an appropriate threshold value 
on the evaluation result of the logic expression against an input. 
Last, an experiment demonstrates how our approach can be used to 
predict a classification result and how to understand the underlying 
mapping based on logic. 


2 COMMUTING QUANTUM QUERY 
LANGUAGE (CQOL) 


The quantum-logic-inspired language CQQL (commuting quantum 
query language) was introduced in [12, 14]. A CQQL condition 
corresponds to a projector p, that is, p is a self-adjoint, idempotent, 
and linear operator of a Hilbert space [15] and an element of an 
orthomodular lattice (quantum logic) with meet for the conjunction, 
join for the disjuction and an orthocomplement for the negation. 
In that formalism, an object o corresponds to a ket vector |o) of 
the length of one being constructed by use of the tensor product 
|o) := |o1) ® ... ® |on). Evaluating a projector p with respect to an 
object |o) corresponds to a quantum measurement expressed by 
{olplo) € [0, 1]. Mutually commuting projectors lead to a distribu- 
tive sublattice and, hence, to a Boolean sublattice [8]. 

Syntactically, a CQQL condition is a Boolean expression (conjunc- 
tion, disjunction, negation). We assume n atomic, unary conditions 
on the n values of an object o. Such a condition expresses gradually 
whether an input value is a high value. It can be easily shown by 
tensor product construction, see [11], that the n atomic conditions 
are mutually commuting. Thus, all conditions constructed on those 
atomic conditions form a Boolean (orthomodular, distributive) lat- 
tice. 

Above we defined the evaluation of a CQQL condition e against 
an object o by calculating (o|pe|o) where |o) and pe are the corre- 
sponding ket vector and the corresponding projector, respectively. 
Actually, from [11, 12, 14] we know that for evaluating conjunc- 
tion, disjunction, and negation of a CQQL condition we do not 
need to construct |o) and pe. Instead, we obtain the same evalua- 
tion result by applying simple arithmetic operations if the CQQL 
condition is in a specific normal form (CQQL normal form). Let 
atoms(e) be the set of atomic conditions involved by a possibly 
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nested condition e. The CQQL normal form requires that for each 
conjunction e; A e2 and for each disjunction e; V eg (but not for the 
special case of an exclusive disjunction) the atom sets are disjoint: 
atoms(e1) MN atoms(e2) = 0. If for ey V ez the conjunction e; A e2 is 
unsatisfiable in propositional logic then the disjunction is exclusive. 
We mark each exclusive disjunction by V. The test on unsatisfiablity 
can be syntactically performed by applying Boolean laws. Ifa CQQL 
condition is not in CQQL normal form then it can be syntactically 
transformed into that normal form by applying Boolean laws [12]. 

A CQQL condition e € CQQL in the required normal form is 
evaluated against an object o by recursively defining 


eval : CQQL x D => [0,1]: 


Atomic condition: If e is an atomic condition then 


eval(e, o) € [0, 1] 


returns the result from applying the corresponding function 
on 0. 


Negation: eval(e, 0) = 1 — eval(e, 0). 
Conjunction: eval(e, A e2, 0) = eval(ey, 0) * eval(ez, 0). 
Disjunction (non-exclusive): 


eval(e V €2, 0) = eval(ey, 0) + eval(ez, 0) — eval(e,, 0) * eval(ez, 0). 
e Exclusive disjunction.: 
eval(e; V e2,0) = eval(e;, 0) + eval(eg, 0). 


For brevity, if the object o is uniquely given from the context then 
we will simply write eval(e) instead of eval(e, 0). 

We now extend the expressive power of a CQQL condition by 
introducing weighted conjunction (e; \g,,6, €2) and weighted dis- 
junction (e1 V@,,9, €2). [13] develops the concept of weights in 
CQQL from quantum mechanics and quantum logic. Weight vari- 
ables 0;, 02 stand for values out of [0,1]. A weight eval(6;) = 0 
means that the corresponding argument has no impact and a weight 
eval(0;) = 1 equals the unweighted case (full impact). We regard 
every weight variable 0; as a 0-ary atomic condition. Before we eval- 
uate a condition with weights we map all weighted conjunctions 
and disjunctions into an unweighted condition: 


((e1 V 701) A (e2 V 702) 
((e1 A 01) V (e2 A @2)) 


See the modified example of a diabetes condition from the intro- 
duction: 


(e1 N6,, 02 e2) — 
(e1 Vo,,0, €2) 


e=GV11(AADAI). 


Both arguments of the conjunction have the maximum weight of 1. 
Mapping it to the unweighted conjunction yields 


e=(GA1I)V(AADAI)A1). 
The value 1 is the neutral element of a conjunction. Thus, we sim- 
plify to 
GV(AADAI). 


The condition is in the CQQL normal form and by applying the 
evaluation rules above we obtain following arithmetic formula for 
evaluation: 


eval(e, 0) = G+ ADI — GADI. 


121 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


Let an object o have the values (0.3, 0.2, 0.9, 0.4) for the attributes 
(G,A,D,1) then we get: 


eval(e, 0) = 0.3 + (0.2- 0.9- 0.4) — (0.3 - 0.2 -0.9- 0.4) = 0.3504. 


So far, we have introduced binary conjunction and disjunction. 
Since the evaluation of a CQQL condition obeys Boolean laws 
(commutativity, associativity), we can write n-ary disjunction and 
n-ary conjunction as short form for a nested binary operation. 


3 COMPLETE DISJUNCTIVE NORMAL FORM 


For a certain classification problem, we want to find a matching 
CQQL condition e together with a well chosen output threshold 
value t 


1 ifx>r 
0 otherwise. 


clf (0) = thr (eval(e, o)) where th;(x) = | 


From the laws of the Boolean algebra we know that every condition 
e can be expressed in the complete disjunctive normal form, that is, 
every condition is equivalent to a subset of 2” minterms. We assume 
for each of the n object attributes exactly one atomic condition cj. 
The minterm subset relation for a condition can be expressed by 
use of minterm weights 0; € {0, 1}: 


= 97 


. Qn 
e ee minterm;, 6, Vig minterm; A 0; (1) 
n 
mintermj = /\\ Cij (2) 
j=l 


cj if(i-1)&271>0 
ac; otherwise. 


(3) 


Cij 


ll 
a 


The symbol ’&’ stands for bitwise and, i— 1 is considered as a binary 
number and j as a bit position. 

Notice that the disjunction of two different complete minterms is 
always exclusive. Thus, e is in CQQL normal form and its evaluation 
against object o yields 


oe n 
eval(e,o) = PS 0; I] ci where (4) 
i=l j=l 
Ot eval(c;, 0) ifG-1)&2-!>0 (5) 
a 1-eval(cj,0) otherwise. 


4 EXTRACTION OF MINTERMS 


Next, we will extract a CQQL condition e in complete disjunctive 
normal form from training data. We have to find the weight value 
0; for every minterm i. The starting point is a set TR of (x, y) pairs 
where x refers to a training object from O with x = (x1,...,Xn) 
and y = m(x) € {0,1}. 

As preparatory step we map all x; from M to x; € [0, 1] by using a 
monotonic mapping function for realizing eval(c;, x). We interpret 
x; as a gradual truth value telling us how high the attribute value x; 
is. It si similar to the concept of a linguistic label in fuzzy logic. The 
mapping function can be linear xj = (x; — min;)/(max; — minj). 
As an alternative mapping function we may follow a probabilistic 
approach and take x/ = P(X; < xj) = ibe f(x)dx where f(x) isa 
density function. In the following, we assume that every x from 
TRUTE is an element of the hyper-cube [0, 1]”. 
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A good classifier is one with high accuracy. Therefore, we max- 
imize the accuracy of condition (1) depending on the minterm 
weights 6; based on TR. 

Before we start, we have to adapt the formula for accuracy to our 
CQQL scenario. That is, we regard eval(e, x) € [0,1] as evaluation 
result of a condition e where a value near to 1 corresponds to true 
and a value near to 0 corresponds to false. For a given (x, y)-pair 
and the evaluation result of a CQQL condition e we can distinguish 
four cases of a confusion matrix: 


y | -y 
eval(e,x) | yAe | 7yAe 
eval(ne,x) | yA ne | ay Ane 


The cases on the diagonal (y A e is the correct prediction in case 
of acceptance, also called correct alarm, and sy A —e is the correct 
prediction in case of rejection, also called correct rejection) refer 
to the correct results. Accuracy acc for a continuous evaluation 
result eval(e, x) = 1 — eval(ne, x) can now be computed over the 
two correct cases y * eval(e, x) + (1 — y)(1 — eval(e, x)). Summing 
up over all training data yields: 


ace = (y * eval(e, x) + (1 — y)(1— eval(e, x))) 

(x,y)ETR 

= oy (eval(e, x) -(2y—1)+1-y) 
(x,y)ETR 

= Ss eval(e,x)-(2y-1)+ ys (1-y) 
(x,y)€TR (x, y)€TR 

2k n 

- S14 -| 6% -(2y—1)+ > (1- y) 
(x,y)ETR\i=1 j=l (x,y)€TR 
2m n 

= Dy » (2y-1)-| |e + » (1-y). 
i=1 (x,y)eTR j=l (x,y)€TR 


We see that accuracy shows a linear dependence on the minterm 
weights 6; for fixed TR-pairs. The first derivative provides the 
constant gradient on 6;: 

n 


— = » (2y-1)-[ [ef 


(x,y)ETR j=1 


Because of y € {0,1} we reformulate the gradient to: 


n n 


mL & [ee 


(x,)€TR j=1 (x,0)€TR j=1 


For maximizing accuracy a minterm weight 6; should have the 
value 1 if dace > 0 and 0 otherwise: 


n n 
1 if YY [[eck> DY [ee 
6; = (xNeTRi=1 7 (x, 0)eTRi=1 (6) 
0 otherwise. 


In other words, for the decision whether a minterm should be active 
(having value 1) or not (having value 0) it is sufficient to compare 
the impact of positive training data (x, 1) € TR against the impact of 
the negative training data (x, 0) € TR on minterm i. Be aware that 
the decision depends on the relative number of positive training 
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objects. Therefore, let yj; = >) > 1 be the 
(x,1)ETR (x,0)ETR 
number of positive and negative training objects, respectively. The 


Yo 
eS Te In the 


uneven case (y # 1/2) we compensate the effect on the minterm 
weight decision by: 


1 and yo = 


fraction of negative objects is then given by y = 


a ee ify-E;>(Q-y)-Ni 
ies | 0 otherwise (7) 
where 
n n 
a= Sle y= DS [et 
(x, N€TR j=1 (x,0)€TR j=1 


We obtained that decision rule by maximizing accuracy where 
correct alarm and correct rejection are given the same weight. 
However, in some real life scenarios correct alarms should have 
a different weight than correct rejections. Such kind of weighted 
accuracy can be expressed by: 


A * y * eval(e, x)) + (1 — A) * (1 - y) * (1 — eval(e, x)) 


where J is the weight for correct alarms. The trade-off between 
good recall and good precision leads to the decision rule 


ji ify-A-E;j >Q-y)-Q-A)- Ni 
ve | 0 otherwise : (8) 


5 STABLE MINTERM WEIGHTS 


Following the minterm decision rule (8) we decide whether a 
minterm is active or inactive. For some minterms the decision is 
very clear. However, the decision is not clear if the left term is very 
close to the right term (y -A- Ej ~ (1-—y)-(1-A)- Ni). We call 
such kind of minterms unstable because of adding a single new 
training object may change the decision. Instable minterms have a 
low impact on the result. We are interested in stable minterms. For 
measuring stability we compute the ratio pj; of the left side to the 
sum of both sides of a minterm i: 


_ yAE; 
yAEj + (1- y) —-A)Ni 


pi € [0,1]. (9) 
A value for p; close to 1/2 means an unstable minterm i, a value 
near to 1 means a stable active minterm i, and a value near to 0 
means a stable inactive minterm i. 

The question is now, should instable minterms be active or in- 
active? We propose to sort all minterms by their values for p; and 
choose a p-threshold 6, out of them that provides a sufficient accu- 
racy and a good compactness of condition e. The modified minterm 
decision rule is: 


: yAE; 
a-1) yaaa 7% | (10) 
0 otherwise 


6 FINDING OUTPUT THRESHOLD 
After applying our decision rule (10) we obtain the formula: 


n 


e= Vi (ey. 


1:0;=1j=1 
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minterm_eval(i,o,n) 
value := 1 
for j in range(n) 
if (i-1)&2/ 
value := value « eval(c;,o) 
else 
value := value « (1 — eval(cj;,o)) 
return value 


Figure 3: minterm_eval 


get_weight(i,TR,n,/,0,) 
E:= 0 N:= 0 jy := 0 yo :=0 
for (x,y) in TR 
if y == 1 
ymoc= y+ 1 
E := E + minterm_eval(i,x,n) 
else 
Yo := yo + l 
N := N + minterm_eval(i,x,n) 
Y= her 
yAE 
P= YXEHI-y)d-aN 
if p>Op 
return 1 
return 0 


Figure 4: get_weight 


object_eval(o,n, {6;}) 


value := 0 
for i in range(1,2”+1) 
if 6; == 1 
value := value + minterm_eval(i,o,n) 


return value 


Figure 5: object_eval 


class(o,n, , {6;},7) 
if object_eval(o,n,{O})>r 
return 1 
return 0 


Figure 6: class 


Its evaluation against an object o returns a continuous value from 
the unit interval: eval(e, 0) € [0,1]. As a last step, we have to find 
the output threshold value rz for 


cli (0) = th,(eval(e, o)). (11) 


Let min, = min, ;erR eval(e, x) be the smallest evaluation result 
of the positive training objects and maxo = max,x,9)eTR eval(e, x) 
the highest result of the negative training objects. In case of maxo < 
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Table 1: accuracy(e, t,_) for different 0)-values (0, and accuracy in percent) 


accuracy on TE #minterms 


Op T accuracy on TR 
50 0.61 80.6 
55. 0.522 80.3 
60 0.474 80.3 
65 0.396 80.3 
70 = 0.242 80.3 
75 0.167 80.6 
80 0.066 79.7 
85 0.0103 77.6 


min, positive objects and negative objects are well separated and 
we set tT = (max + min,)/2. 
Otherwise, we have to choose a t value from the interval 


T € [min, max]. 
1 0 


In order to find a threshold which maximizes discrete accuracy we 
use the training objects from TR: 


T= arg max 
(x, _)€ TR, 

Tx := eval(e, x), 

Tx € [miny, maxg] 


accuracy(e, Tx, TR) (12) 


where 
[{(x, y) € TRiy = cle (x)}| 
|TR| 
For evaluating the power of prediction the classifier needs to be 
checked against the test set TE: 


I{(x,y) € TEly = cle (x)} 
|TE| 


The extracted CQQL condition e can now be presented to the user 


accuracy(e, Tx, TR) = 


accuracy(e, T, TE) = 


and gives an understanding of the logical connection between input 
and class decision. A condition in disjunctive normal form is often 
hard to understand. Due to the fact, that for our approach the rules 
of the Boolean algebra hold the condition can be brought into an- 
other syntactical form. If the number of attributes is small then the 
Quine-McCluskey algorithm [7] may be applied to simplify con- 
dition e. Notice, because of having 2” minterms our approach has 
exponential time and space complexity on the number of attributes. 

Besides giving an understanding the condition should make 
good predictions on the test set. It can happen, that the described 
approach suffers from overfitting. In that case, a more compact 
condition may improve understanding and accuracy on test data. 
A starting point for finding a more compact condition is to analyze 
single minterms of the simplified disjunctive normal form and single 
maxterms of the simplified conjunctive normal form. Next section 
will demonstrate steps for finding a compact CQQL condition. 

The derived formulas can be easily implemented. The algorithm 
in Figure 3 implements Equation 2, the algorithm in Figure 4 imple- 
ments Equation 10, the algorithm in Figure 5 implements Equation 4, 
and the algorithm in Figure 6 implements Equation 11. 


7 EXPERIMENT 


We demonstrate the power of our quantum-logic-inspired classifier 
by use of the classification problem PIMA. PIMA is a set of data 
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77.41 112 
74.19 89 
80.64 72 
79.03 57 
77.41 39 
74.19 24 
69.35 

67.74 2 


about diabetes of some selected people from India. The data set M 
contains values for the eight attributes: pregnancies (P), glucose 
(G), blood pressure (BP), skin thickness (S), insulin (I), BMI, dia- 
betes pedigree function (D), and age (A) as well as the information 
whether diabetes occured or not. The challenge is to understand 
the relation between the eight attributes and the occurrence of dia- 
betes and to enable a good diabetes prediction. PIMA contains data 
about 768 people. However, the data is complete only for 392 people 
(|M| = 392). We partition M into training data TR with |TR| = 330 
and test data TE with |TE| = 62. For further processing the input 
values of the eight attributes are mapped to the unit cube [0, 1]® 
using the density-function-based mapping. 

Applying Equation 9 to TR yields the p-values for 2° = 256 
minterms. For selecting the best set of minterms we try out differ- 
ent 0p thresholds, compute the respective best output threshold r, 
measure discrete accuracies on TR and TE and count the number 
of active minterms, see Tab. 1. For the consecutive discussion, we 
choose 6; = 0.65 and obtain e in simplified disjunctive normal 
form, see Tab. 2 left. In Tab. 2 right we see e after its transformation 
to the simplified concjunctive normal form. 

Formula e with discrete accuracy of 79% is very complex and 
the starting point for understanding the classifier. We search for 
simpler formulas f and are especially interested in two kinds of 
formulas when we regard e as condition in propositional logic: 


e f is necessary (e = > f): All input data that are classified 
by e as class members satisfy f. That is, with respect to e, 
false rejections of f cannot occur and thus recall is 100%. 

e f is sufficient (f => e): All input data that satisfy f also 
hold e. That is, with respect to e, false positives of f cannot 
occur and thus precision is 100%. 


Please notice, that all minterms of e are sufficient formulas and all 
maxterms are necessary formulas. 

In Table 3 we compare different formulas f with formula e. We 
count the number of minterms of e A f (11), e A af (10), ne A f 
(01), and se A =f (00) with 00 + 01 + 10 + 11 = 28 and compute the 
respective values for precision, recall, and accuracy on the num- 
ber of shared minterms. At first, we try out every single attribute. 
We see, that glucose (G) plays with accuracy of 69% an important 
role. Next, we try out attributes in combination. The first three 
maxterms in Tab. 2 right are very compact and necessary formulas. 
Therefore, recall ist 100% but precision is low. In the next section 
of Table 3, we test simple conjunctions of the three attributes G, 
BMI, and A. They are neither necessary nor sufficient but show 
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Table 2: Minterms (left) and maxterms (right) of e in dnf and cnf, respectively; 0 stands for negated atom, 1 for non-negated 


atom and an empty cell for no-care 


minterms 
P G BP S I BMI D A 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 1 1 1 
1 1 0 1 1 
1 0 1 1 1 
1 1 0 1 1 
1 1 1 1 
0 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
(0) 1 1 1 1 1 
0 1 0 1 1 1 


good accuracy on minterms. In the last section, we combine the 
tree maxterms conjunctively into one formula and obtain again a 
necessary formula with better accuracy than the single maxterms. 

So far we have tested f against e based on the numbers of shared 
minterms but e itself shows an accuracy of 79% against the test 
data. Therefore, we have to test the formulas f from Table 3 against 
our data set M = TRUTE and obtain Table 4. The measurements 
differ from the measurents in Table 3 since for Table 4 new output 
threshold values t for each formula f are chosen by maximizing 
accuracy’. Interestingly, we see the last formula G V (AA D A I) 
being very compact. Its accuracy is not less than that for e. It looks 
as if we have not derived the optimal formula e. However, please be 
aware that equation 9 is based on a continuous accuracy whereas 
in Table 4 discrete accuracy was calculated. 

In literature* the data set PIMA was used to build a logistic 
regression classifier with accuracy 73%, a SVM classifier with an 
accuracy of 75%, and a random forest classifier with an accuracy of 
88%. Only the last one shows a better accuracy than G V (AAD A 1). 
However, our formula G v (A A D A I) is very compact and can be 
interpreted very easily: There is a high risk of diabetes in case of 
a high plasma glucose concentration (G) or in case of a high age 


3We are free to modify r values in order to improve precision at the expense of recall 


or vice versa. 
‘https://www.kaggle.com/datasets/uciml/pima-indians-diabetes- database 
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maxterms 
P G BP S I BMI D A 
1 1 
1 1 
1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
Ls, “fl 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
1 1 0 
1 1 1 1 
1 1 1 1 
0 1 1 1 
0 1 1 1 
1 0 1 1 
0 1 1 1 
1 0 0 0 1 


(A) together with a strong diabetes pedigree (D) and a high level of 
Insulin (1). 


8 CONCLUSIONS 


In our paper we propose a classifier based on quantum logic. In 
contrast to Boolean logic (input thresholds) quantum logic can di- 
rectly deal with continuous data, which is typically available in 
many classification scenarios. Other than fuzzy logic quantum logic 
is based on a sound theoretical theory and, if some restrictions 
are respected, it obeys the rules of Boolean algebra. Our quantum- 
logic-inspired classifier is a tool to interpret the mapping between 
classification input and output by means of logic. The diabetes 
experiment demonstrates a good quality of prediction and of ex- 
plaining the relation between input and class decision. We used 
rules of Boolean algebra in order to deal with CQQL conditions 
as same as with propositional logic in order to find a compact 
condition with good prediction and a good understanding. 

However, our approach is based on the disjunctive normal form 
where the number of minterms explodes with the number of at- 
tributes. Thus, we are restricted to classification problems with a 
small number of attributes. 
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Table 3: Minterm comparisons of formulas f against formula e derived from 0, = 0.65 


Ingo Schmitt 


f precision recall accuracy 00 01 10 11 
G 40 89 67 122 6 77 51 
BMI 31 70 59 111 17 88 40 
A 35 79 63 116 12 83 45 
D 31 70 59 111 17 88 40 
P 27 60 54 105 23 94 34 
BP 26 58 54 104 24 995 333 
S 28 63 56 107. 21 «92 ~=—36 
I 33 74 61 113, 15 86042 
AVG 30 100 47 64 O 135 57 
DvVG 30 100 47 64 O 135 57 
IvVG 30 100 47 64 O 135 57 
GAAA BMI 78 44 85 192 32 7 25 
GAA 61 68 83 174 18 25 39 
GA BMI 55 61 80 177) 22, 29) 335 
A A BMI 47 53 76 165 27 34 30 
GV(AADAD 40 100 66 112 0 87 57 


Table 4: Evaluation formulas f against TR U TE with new threshold 


f precision recall accuracy 00 01 10 11 
G 81 42 77 249 76 13 54 
BMI 53 19 68 240 105 22 25 
A 64 36 72 235. 83 27) 47 
D 61 15 69 249 110 13 20 
P 70 32 73 244 88 18 42 
BP 66 15 69 252 111 10 19 
S 53 23 68 235. 100 27 30 
I 53 66 69 185 44 77 86 
AVG 73 59 79 234 53 28) «=77 
DVG 77 42 77 246 75 16 55 
IvVG 69 42 74 237 75 25 55 
GAAA BMI 75 48 77 241 68 21 62 
GAA 72 55 78 235, 58 27) 72 
GA BMI 69 52 77 232 62 30 68 
A A BMI 64 35 72 237 85 = =—25 45 
GV(AADAD 77 58 80 240 55 22 75 


In further work, we will design algorithms that do not suffer from 


the exponential number of minterms. Furthermore, our approach 
has to be tested against more classification scenarios. 
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Abstract 


The current pandemic has led to increased use of online 
learning and calls for innovative self-learning techniques. 
Since contact with educators is limited, students are required 
to become more self-reliant. This endeavour includes using 
self-assessment tools to measure the learning progress and 
uncover the areas for further studies. In this paper, we focus 
on a system that automatically generates various types of 
questions from the recommended course material. The sys- 
tem applies the most recent machine learning techniques, 
such as transfer learning, natural language generation meth- 
ods and finding semantic similarity. We propose a human- 
in-the-loop approach where the instructor can provide his 
guidance. Our system would help students to calibrate them- 
selves in a typical remote learning environment. 


CCS Concepts: » Computing methodologies — Machine 
learning; - Information systems — Data management 
systems; » Human-centered computing — Visualization. 


Keywords: Multiple-Choice Question (MCQ), Study quiz, 
NLP, NLG 
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1 Introduction 


CrsMer [34] is an online system used for over a dozen years 
to support the administrative and pedagogical processes of 
university-level teaching. It has also been used to transform 
traditional "final" exams into multiple online quizzes combin- 
ing multiple-choice questions (MCQ) and traditional written 
answers to text questions - the answers could be submitted to 
the system. This shift in the examination practices is related 
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to the need for prompt examinations, thus, avoiding study 
delays until the end of the course. Online assessments have 
become popular among educational institutions, proficiency 
examiners, and corporations. As a result of the pandemic, 
most university courses moved online and the evaluation 
using online evaluation became necessary. One of the issues 
we discovered was that many students were not used to on- 
line quizzes and did not do well. Moreover, the topics for the 
tests were limited to the most recent lessons. We realized that 
CrsMegr’s online quiz functionality could easily be expanded 
to reflect the synchronous nature of the online classes. 

The motivation of this project is to create on-demand 
practice quizzes for the students using the required texts and 
class notes. The generated questions are an amalgamation 
of multiple-choice questions with one or several correct an- 
swers, true or false questions, fill-in-the-blanks questions 
by choosing appropriate values, and interrogative questions 
extracted from the learning material. The application uses 
natural language processing (NLP) and machine learning 
(ML) techniques. The input source consists of course materi- 
als and class notes posted on CrsMgr. Additional inputs are 
the source of textbooks written by one of the authors. These 
textbooks are copy-righted and available from the university 
library and other sources [24, 25]. Furthermore, our system 
aims to automatically evaluate and provide feedback on the 
test results, thus helping the students prepare for the real 
quizzes. 


2 Present Developments 


The manual creation of multiple-choice questions for on- 
line quizzes is a cumbersome and time-consuming task for 
the instructors. On the other hand, the automated genera- 
tion of questions and answers is challenging and extremely 
broad, as it involves solving several sub-problems. Many 
researchers and top organizations are attracted to building 
online question and answering and assessment applications. 

A myriad of online assessment tools, commercial or free, 
try to provide useful features for online training. Socrative is 
an interactive student response system that empowers teach- 
ers to engage their classrooms through quizzes and pools 
via smartphones and tablets [13]. ProProfs (Quiz Maker) is 
another online quiz creation app where users can select from 
a large library of questions organized by topic or upload the 
questions from an excel file [10]. Google Forms can also help 
users manually create quizzes or select a quiz template from 
many public templates. The advantage is that it works with 
other Google Apps, so the quiz can be sent to students via 
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their mail or embedded in a Google Site. ClassMarker is a 
secure, professional web-based quiz maker, including instant 
grading and saving hours of paperwork [1]. Another online 
application that provides pre-built quiz questions is ThatQuiz 
[16]. It has built-in quizzes for math, science, language arts, 
and social studies, which are adjustable in difficulty and 
length. ExamTime Quizzes [3], Testmoz [15], Gnowledge 
[4], Online Quiz Creator [9], GoToQuiz [5], QuizStar [12], 
Survey Anyplace [14], Mentimeter [8], and Edmodo [2] are 
some more of the many online quiz creation applications. 

Although these systems are helpful with the entire exami- 
nation process, the random selection of questions, the timing 
and the instant grading, the problem is that the user has to 
create the questions manually. There are some options to 
upload questions using Excel templates or select them froma 
library of pre-built questions, but it is still a time-consuming 
process. 


3 CrsMgr Quiz Generation Sub-System 


Our quiz generation system is an extension of CrsMgr. Ques- 
tions are generated from the passages the instructors se- 
lect from the course materials (lecture notes or textbook 
excerpts). The instructor has also the option to upload the 
corresponding pdf file. The generated questions can be classi- 
fied into multiple choice (MCQ), true/false, fill-in-the-blanks, 
and questions with an interrogative wh-word (What /Who 
/When /Why /How). The application uses machine learning 
models to generate the questions, the corresponding correct 
answer, as well as the incorrect answers, known as distrac- 
tors. The obtained question-answers sets are then stored in 
a database as a question bank for future use. Unlike some of 
the systems that have pre-built questions, our subsytem for 
CrsMer generates the quiz automatically based on the course 
material. Students could use the same portal for self-learning 
and all other activities related to the course; they do not need 
to switch to a different portal. 


Text passage Training 
Dataset 
User | 


Summary of the 
passage 


( ) 
Generates questions ae 
| | 
and answers stores 
QnA in DB. 


—> Keyword Extraction ——> 


Figure 1. The architecture of CrsMgr for generation of prac- 
tice quiz. 


Figure 1. explains the quiz generation system in details. 
In the first step, the user enters the passage in the text area 
to generate the questions. Then, the application creates a 
summary of the text and extracts the keywords from the sum- 
mary text. The application will then select the sentences that 
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the keywords represent and generate the question. Along 
with the correct answer, the incorrect answers are also gen- 
erated. The application uses WordNet [27], a large lexical 
database of semantically similar words, combined with NLP 
techniques to produce the distractors. 


3.1 Multiple Choice Questions (MCQ) 


MCQ is a form of objective assessment in which respondents 
are asked to select the correct answer from the choices of- 
fered. The multiple-choice format is the most frequently used 
in educational testing. Edward Lee Thorndike worked on an 
early scientific approach to test students, and his assistant 
Benjamin D. Wood developed the multiple-choice test [17]. 

In the mid-20th century, MCQ’s popularity increased. Mul- 
tiple choice question types are chosen because they are af- 
fordable for testing many students. The drawback is that stu- 
dents can take a guess when answering the question rather 
than genuinely understand the concept. Despite all the flaws, 
MCQs are popular because they are easy to create, score and 
analyze [32]. 

MCQs include three areas of concern: the question sen- 
tence, the correct answer and the distractors. The number of 
distractors is a parameter and can be determined by the user. 
The application will generate a question, the correct answer, 
and the specified number of incorrect answers. The question 
sentence can be a fill-in-the-blanks or a Wh-question, so 
these types of questions belong to the category of multiple- 
choice questions. 


3.2 True/False Questions 


True/False questions are also known as polar or general 
questions [21]. They have only two possible answers, either 
"True" or "False". However, it could also be "Yes" or "No", 
"Agree" or "Disagree", or any other suitable pair of mutu- 
ally exclusive responses. One of the huge drawbacks of the 
True/False questions is that the learner has a 50% chance of 
choosing the correct answer, which could be inadequate for 
testing the actual knowledge. However, many educational in- 
stitutions and organizations use these types of questions dur- 
ing assessments. They can ask tricky questions and confuse 
the learner so that they can test his or her understanding. 


3.3 Fill-in-the-blanks Questions 


A fill-in-the-blanks question consists of a sentence with a 
blank space where the student can fill the missing word. This 
type of question can also be clubbed with multiple choice and 
is easy to evaluate automatically. One famous test using fill- 
in-the-blank questions is the cloze test, also called the cloze 
deletion test or the occlusion test. A cloze test is an exercise, 
test, or assessment consisting of a portion of language with 
certain items, words, or signs removed, where the participant 
is asked to replace the missing language item [33]. 

In our subsystem of CrsMegr, the different types of ques- 
tions are mixed, aiming for diverse and captivating questions. 


128 


An Online MCQ sub-system for CrsMgr 


3.4 Datasets 


To train our models, we need a labelled dataset of passages, 
possible questions and answers retrieved from the passage. 
Many datasets such as SQUAD, 30MQA, MS MARCO, RACE, 
NewsQA, Trivia QA, TabMCQ, SciQ and NarrativeQA con- 
tain question-answer sets and are used for training of question- 
answering machine learning models [23]. These datasets 
contain millions of rows with data sampled from Wikipedia, 
web searches or crowd sources. These datasets were mainly 
collected for reading comprehension tasks, making them a 
good candidate as our goal is also to evaluate the students’ 
comprehension of the study material. 

Other datasets like MCQL are intended for automatic dis- 
tractors generation [28]. Another well-designed dataset is 
LearningQ which covers a wide range of learning subjects 
and contains a large set of document-question pairs and 
multiple source sentences for question generation [20]. 

In this subsystem of CrsMgr, the following datasets are 
employed to train the model: BoolQ [22], SQUAD [30], CoQA 
[31], MS MARCO [18]. Afterwards, the application uses the 
text passage that the instructor submits as an input source 
to generate the quiz questions. 


4 Challenges Of MCQ Generation 


The main challenge is to generate quality questions compa- 
rable to the questions a human teacher would create. The 
contextual language models we have applied significantly 
improve the results compared to previous rules-based or 
recurrent-network-based approaches. We found that addi- 
tional pre-processing steps like summarizing the text, simpli- 
fying the sentences, and replacing the pronouns can improve 
the questions. 

Another crucial challenge is the semantic gap. In a natural 
language, the same meaning can be expressed differently, 
and the same phrase can have different meanings. The appli- 
cation sometimes fails to identify such words. For example, 
assume we need to find the distractor for the word "tree" in 
the context of the computer science domain. However, in 
NLP it is challenging to find the context of the word when 
only one word is specified. Therefore, the system might gen- 
erate irrelevant context distractors. As a result, the student 
would easily guess the answer from the multiple-choice as 
the wrong answers will refer to different domains or con- 
texts. 

Additional issues are the deviation from the question or 
unclear questions. The answer deviates from the original 
question. Sometimes it is difficult for the online assessment 
to identify the correct keyword to be used to generate the 
question. If the application generates a vague or ambiguous 
question, it might lead to multiple answers and thus make it 
difficult to validate the student’s comprehension. 

Another problem is the generation of questions that are 
too simple and too easy for the user. Adding one more layer 
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in the model that classifies the question’s difficulty level 
would help solve such an issue. 

In Figure 2, the example question depicts one of the chal- 
lenges of online question creation. Here the correct answer is 
‘Instance’ and belongs to the computer science domain type 
of question and answer. CrsMgr uses NLP WordNet to gener- 
ate distractors (incorrect answers) for the given question, and 
the generated distractors are not computer science-related 
words. In turn, NLP WordNet took the word ‘Instance’ and 
obtained the synonym but did not consider the domain type. 
The generated incorrect answers deviated from the question 
topic. Therefore, it becomes easy for the student to guess the 
answer and difficult to test the actual knowledge. 


The word, » in this context means occurrence. 
1) Appearance 
ae | Instance 
3) Accompaniment 
4) Accident 


Figure 2. Question generated by CrsMgr illustrating the 
challenges of distractor generation. 


5 Current Status Of CrsMgr-MCQ 
Generation 


In this subsystem of CrsMgr, we generate Multiple Choice 
Questions (MCQ) from the course material. The instructor 
has two options either select a passage that will be used to 
generate the questions or upload the corresponding pdf file. 
Currently, the system generates fill-in-the-blanks and wh- 
questions with multiple choice answers as well as True/False 
questions. In Figure 3. we can see some of the questions that 
the current subsystem of CrsMgr has generated. 


5) A relational database consists of one or more tables which are containers for the 
1) Ana 
2) Armamentarium 
2) Data 
4) Agglomeration 

6) Conceptually, a database is a container and the 
1 Data 

Armamentarium 


is the contents of the container. 


) 

) 

) Agglomeration 

) Ana 

7) Users input by means of the keyboard or mouse and receive data from the monitor. 
)  Armamentarium 

) Agglomeration 

) Data 

) Ana 

» is a constraint (restriction) on the value of 

) Agglomeration 

) 

) 

) 


8) A dom in a column of a table. 
Data 
Ana 
Armamentar ium 
9A consists of one or more tables which are containers for the data. 
) Relational database 
) Relational Database 
) Object-oriented Database 
) Lexical Database 
is a constraint (restriction) on the value of data in a column of a table. 
) Group 
) Field 
) Domain 
) Diagonal 
11) An example of a 
) Diagonal 
) Field 
) Group 
) Domain 


is a string with a maximum number of characters. 


Figure 3. Sample questions generated by CrsMegr. 


The instructor can add, delete or modify the questions and 
the multiple-choice options and then save the result to the 
database, after which the quiz becomes available to students 
for self, non-timed practice. 
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In order to generate the questions, we use the Summa 
python library [19] for extractive-based summarization. The 
library selects the most important sentences in a document 
or text. We define the summary length as a proportion of the 
text, and the program considers only mid-length sentences 
for generating the questions. This pre-processing aims to 
help the model generate relevant and meaningful questions. 

The PDF (Portable Document Format) files are made up 
of text, vector graphics, raster graphics, and multimedia. 
The PyMuPDF python library [11] is used to read the PDF 
file. Unlike HTML, where each text has tags like header or 
paragraph, PyMuPDF reads the text but does not have tags or 
structure to identify the passage text. The program extracts 
font style information like font size and most used fonts 
sorted by count and colour. The HTML tags are decided 
based on the count of words of the same font size. 

The highest number of words of the same font size is as- 
signed paragraph text. The program considers the sentences 
marked as paragraph text for the extractive-based summa- 
rization. The PDF documents are opened page by page, the 
HTML tags are identified, and the summary is generated. 
However, this approach does not give a perfect result. If the 
PDF page contains any program or table, the program will 
fail to identify the paragraph text. 

In addition to unit testing and integration testing, users 
were able to test the functionality during the Winter session 
of 2022 in a real-life scenario. Enrolled students were able 
to try out the self-study quiz in our web application. The 
results of the assessments are logged in without resorting to 
privacy-violating monitoring tools. 


6 Exploring Transformer Models 


We conducted additional experiments with some Transformer 
models [6] published by Google as they perform well across 
a wide variety of NLP tasks. We were interested in how 
these models can help solve our challenges in generating 
quizzes from textbooks. These models are worth considering 
because they work with representations of words within 
their context. Hence, these models excel at matching phrases 
that are semantically related although not exact match. This 
characteristic is helpful for the reading comprehension task. 


6.1 Methodology 


We selected models based on BERT [26] and T5 transformers 
[29] and fine-tuned on different datasets. These models were 
trained with the goal to generate reading comprehension- 
style questions with answers extracted from the source text. 
The training datasets include a paragraph, corresponding 
questions, and answers. The purpose of this data format is 
to fine-tune the models to generate questions and predict 
answers. We have considered different models. Some mod- 
els only require the input paragraph as and can generate 
questions or question-answer pairs. Other models expect 
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both the paragraph and the answer and can output the ques- 
tion. These models can be used in combination with other 
methods to identify keywords and then ask questions about 
those keywords. There is a third approach where we ask 
the model to predict the answer by a given paragraph and 
question. Detailed results of our experiments can be found 
in our notebook [7]. 


6.2 Experiment: The role of the training dataset 


This experiment aims to verify if the training dataset plays 
a role in the quality of the questions that can be generated. 
We have selected four models based on T5 transformers 
architecture and fine-tuned on different datasets. As we can 
see in Figure 4, the style of the generated questions depends 
on the dataset on which the model has been fine-tuned. Some 
models have the capacity to generate both factual and yes/no 
questions. This is the case of Model 2, which has been trained 
on several datasets. 


TS Transformers 


Passage from course material: 


“Data structures serve as the basis for abstract data types. The abstract data type defines the 
logical form of the data type. The data structure implements the physical form of the data type. 
Different types of data structures are suited to different kinds of applications, and some are 
highly specialized to specific tasks. For example, relational databases commonly use B-tree 
indexes for data retrieval, while compiler implementations usually use hash tables to look up 
identifiers. Data structures provide a means to manage large amounts of data efficiently for 
uses such as large databases and internet indexing services. Usually, efficient data structures 
are key to designing efficient algorithms. Some formal design methods and programming 
languages emphasize data structures, rather than algorithms, as the key organizing factor in 
software design. Data structures can be used to organize the storage and retrieval of 
information stored in both main memory and secondary memory.” 

Model 1: Model 2: 


allenai/t5-small-squad2-question-generation | iarfmoose/t5-base-question-generator 
(fine-tuned on SQUAD 2.0) (fine-tuned on SQUAD, CoQA, and 
MSMARCO) 


Generated questions: 


Generated questions: 


Figure 4. Questions generated by two T5-based models fine- 
tuned on different datasets. 


6.3 Experiment: The role of the decoder 


The purpose of the second experiment was to observe how 
the decoding method can help us improve the quality and the 
variety of the questions. In Figure 5, we show the questions 
generated by the same model but using different decoding 
methods. 
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We have tried the following decoding methods: 


e Greedy search 

e Beam search 

e Top-K sampling 

e Top-P sampling 

We have observed that the decoding strategy can influ- 

ence the style of the generated questions as it will decide 
which tokens are selected as output. For example, if we use 
greedy decoding, we can get stuck in the same sets of words 
because the greedy algorithm always selects the tokens with 
the highest probability. On the other hand, beam search will 
introduce more variety in the output, as it keeps the most 
likely hypotheses at each step and eventually chooses the 
hypothesis with the overall highest probability. This way, we 
include some word sequences in the output that the greedy 
search might be missing. In Top-K sampling, the K most 
likely next words are filtered and depending on the distri- 
bution and the value of K we can end up selecting very 
unlikely words. In Top-P sampling (or nucleus sampling), 
instead of sampling only from the most likely K words, we 
choose from the minimum number of words whose cumula- 
tive probability exceeds the probability P. In conclusion, the 
different decoding methods introduce more variety in the 
output questions because human language does not always 
select the highest probability words as in the greedy search. 
Based on our observations, we would recommend the beam 
search decoding because it introduces some variety and, at 
the same time it, produces more accurate results than the 
Top-PK strategy. 


Text 1 Different types of data structures are suited to 
different kinds of applications, and some are highly 
specialized to specific tasks. 

Greedy decoding [What are some data structures highly specialized 

to? 

fee are different types of data structures suited to?] 

[What type of data structures are suited to different 

kinds of applications?] 


Top-PK decoding (What is one of the most specialized services of data 
structures?) 
[What are some of the data structures highly specialized 
to?] 
[What are different types of data structures suited to?] 


Figure 5. Questions generated by the same model using 
different decoding methods. 


6.4 Experiment: BERT for question answering 


We have applied the model BertForQuestionAnswering, which 
is based on BERT architecture [26] and fine-tuned on SQUAD 
dataset [30]. We have asked the model to generate an answer 
on three different types of text. One paragraph was selected 
from a textbook, another paragraph was selected from a 
video transcript of a lecture, and the third paragraph was 
selected from a Wikipedia page. We observed that the model 
worked very well on the Wikipedia paragraph; in second 
place came the selected paragraph in a textbook, and the per- 
formance was worse in the case of video transcription. This 
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observation brings us back to the problem of choosing the 
appropriate training dataset to help the model learn about 
the specific domain. Most models are trained on Wikipedia 
text; however, they have difficulties generalizing on more 
complex text found in textbooks or unstructured text like 
video transcripts. 


7 Results 


In summary, we have created a sub-system of CrsMgr able 
to generate MCQ quizzes from selected course materials. 
We have encountered several challenges and explored how 
different language models could help us in the quiz genera- 
tion task. We focused on models based on GPT-2, BERT and 
T5 text-to-text architectures. We did not have time to com- 
plete our studies of other models like GPT-3, which could 
further improve the quality of the question-answers sets. 
Nonetheless, based on our findings, we understand the pos- 
sible factors that could help us build a model capable of 
generating quality quizzes. These factors are: 


1. Select appropriate training datasets related to the course- 
work material to fine-tune the models on the specific 
task of quiz generation in the educational context. 

2. Apply beam search as a decoding technique. 

3. Assess and eliminate low-quality questions-answers 
pairs. 

4. Pre-process the paragraph appropriately to facilitate 
the quiz generation. 

5. Find better methods to generate distractors. 


Figure 6. Selected answers generated by BERT model. 


We observed that each model has its advantages and dis- 
advantages. Hence, we need to combine different models to 
obtain a wider variety of questions and more accurate an- 
swers. Another issue is that the training datasets are mostly 
sourced from Wikipedia, but in the educational context, the 
course materials could be much more complex. For this rea- 
son, we may try other datasets more suited for generating 
multiple-choice questions in educational context. 
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8 Conclusions and Future Work 


This paper describes a work in progress of building a subsys- 
tem of CrsMgr for quiz generation. We have tried to adapt 
some of the existing question-answering and question gen- 
eration language models, and the results are promising. 

However, there is scope for further improvement in the 
quality of the questions and the answers. Although the pre- 
trained machine learning models are abundant, they are not 
specialized in generating multichoice quizzes in an educa- 
tional context. Further work is needed to determine which 
models or combinations of models are best for this task. 

That is why we propose the following areas for future 
work: 

Dataset: Our approach is data-driven; however, the datasets 
we have used so far are open-sourced large-scale Question- 
Answer datasets, extracted mainly from Wikipedia and an- 
notated by humans. However, when dealing with specific 
professional domains, the corpus data is more complex than 
Wikipedia articles. Therefore, we can further fine-tune our 
models on domain-specific datasets to improve their rele- 
vance and accuracy. 

In addition, we can apply different pre-processing tech- 
niques to facilitate the work of the model. So far, we have 
used the summarization technique so that the model can 
focus on the essential parts of the text. We can apply other 
data augmentation techniques like named entity resolution, 
replacing pronouns, noun phrases and verb phrases, simpli- 
fying the sentences, paraphrasing and synonym replacement. 
The goal is to select informative sentences and words to gen- 
erate the question-answer pairs. We can also better handle 
unknown words and specific terminology by providing the 
model with a domain-specific dictionary. 

Model: Our goal is to generate diverse and relevant question- 
answers pairs. We plan to explore different encoding and 
decoding techniques to improve the variety of our quizzes. So 
far, we have confirmed that the beam search decoding intro- 
duces more variety in the generated questions than greedy 
decoding. It selects several highly probable words depending 
on the beam size in contrast to the greedy search, which 
selects only one word - the highest probability word. We can 
explore other decoding techniques like Monte Carlo Tree 
Search or Value Guided Beam Search. These decoding meth- 
ods might introduce more variety in the output questions 
closer to human language. 

Another challenge we face is to ensure the questions, the 
corresponding answer and the distractors are semantically 
consistent. We could apply techniques like nearest neigh- 
bours or distance measurements to ensure the semantic sim- 
ilarity of the generated distractors. 

Evaluation: In some cases, our models may not give good 
results. We need to use the appropriate evaluation metrics. 
For example, question similarity calculation could exclude 
redundant questions. 
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Further experiments: We have explored several pre- 
trained language models; however, there are many more that 
could become a good base candidate. We plan to try other 
models and perform hyper-parameter tuning to select the 
best models. We can combine the best ones in an ensemble 
to improve the outcome. 

Finally, the automated generation of MCQ quizzes is chal- 
lenging and extremely broad while having important prac- 
tical applications in reality. Our system’s final version will 
have different types of questions to make self-learning en- 
joyable and valuable. Learners can use this system as a mock 
test when preparing for final examinations, and teachers can 
use it to save time and provide more options for their stu- 
dents. Overall, the teaching process will be more worthwhile 
and efficient. 
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1 INTRODUCTION 


Conformance checking is an integral part of ARTIFICIAL INTELLI- 
GENCE bridging data mining and business process management 
[7]. It assesses whether a sequence of distinguishable events (i.e., 
a trace) conforms to the expected process behaviour represented 
as a (process) model [22]. Each event is associated with both an 
activity label describing the captured event, as well as payload data, 
either associated to the whole trace or to a specific event. When 
multiple distinct traces are considered in a log, model checking 
lists the traces satisfying the model [8]. Non-conforming traces are 
usually referred to as deviant [7]. Declarative models are composed 
of multiple human-readable clauses that should be jointly satisfied 
(i.e., conjunctive query) [14]; each of these is the instantiation of 
a specific behavioural pattern (i.e., template) expressing temporal 
correlations between actions being carried out thus linking precon- 
ditions to expected outcomes. Such correlations might also involve 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than ACM 
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 
to post on servers or to redistribute to lists, requires prior specific permission and/or a 
fee. Request permissions from permissions@acm.org. 

IDEAS’22, August 22-24, 2022, Budapest, Hungary 

© 2022 Association for Computing Machinery. 

ACM ISBN 978-1-4503-9709-4/22/08. ..$15.00 
https://doi.org/10.1145/3548785.3548786 


IDEAS2022 


pressed as Finite State Machines [2, 15] but, by doing so, each state 
will represent a possible state configuration the system might find 
itself in, for which we need to describe all the reasonable actions 
and data conditions. This makes graph data-aware model checking 
as [7] rather inefficient, as the size of these graphs becomes expo- 
nential with respect to the original size of the declarative model. 
As a result, this increases the computational time required for con- 
formance checking. Such models are also incapable of expressing 
©-correlation conditions on the data payload, thus limiting the 
models’ expressiveness. 

Conformance checking with declarative models is a well-studied 
technology at the core of AI’s temporal decision making. Firstly, 
conformance checking is adopted when mining a model from logs 
either containing only positive (or negative) traces [21], or on logs 
containing both, but where positive traces can be discriminated 
from the negative ones via behavioural or data conditions, thus 
allowing to generate both a positive and a negative model [6]. 
The example proposed in this paper (Figure 1), contains cancer 
patient records obtained from a hospital (this data is included in our 
datasets®). In healthcare, individuals likely to suffer from an illness 
should receive treatment, and those that are not suffering should 
not. Therefore, cases where sufferers not receiving treatment (false 
negatives) and non-sufferers receiving treatment (false positives) 
needs to be minimised. Figure 1 proposes a simplified scenario for 
defining this scenario. We conisder 2 event payload labels: CA 15- 
3 (cancer antigen concentration in a patients blood), and biopsy 
(biopsies should be taken before any procedure is acted upon). Our 
model targets only breast cancer patients with successful therapies, 
that describes a medical protocol and the desired patients’ health 
condition at each step. © states two possible surgical operations 
for breast tumours are mastectomy or lumpectomy if the biopsy 
is positive and the CA-13.5 is way above (> 50) the guard level 
being 23.5 units per ml, and @-@®) any successful treatment should 
decrease the CA-13.5 levels, which should be below the guard level; 
such correlation data condition is expressed via a © condition (also 
indicated as where). A twinned negative model (not in Figure) 
might better discriminate healthy patients from patients where 
the therapy was unsuccessful. Secondly, conformance checking 
can also exploit such models for predicting which novel clinical 
situations represented as traces are likely to adhere to the expected 
clinical standards. Novel situations can be represented as a log: 
e.g., in Figure 1, we have three patients: © a cancer patient with a 
successful mastectomy, ® a healthy patient, and @ an unsuccessful 
lumpectomy, thus suggesting that the patient might still have some 
cancerous cells. Given the aforementioned model, patient © will 
satisfy the model as the surgical operation was successful, @ will 
not satisfy the model because neither mastectomy nor lumpectomy 
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Figure 1: KnoBAB Architecture for Breast Cancer patients. Each trace ©-© represents one single patient’s clinical history, are 
represented with unique colouring. Please observe that the atomization process does not consider data distribution but rather 
partitions the data space as described by the data activation and target conditions. In the query plan, green arrows remark 
access to shared sub-queries as in [4], and thick red ellipses remark which operators are untimed. 


was required (M is only fulfilled for successful procedures), and @ 
will not satisfy the target condition, even though the correlation 
condition was met. Our model of interest should only return © as 
an outcome of the conformance checking process. 

Real business use case scenarios usually require ©-correlations. 
In a goods brokerage scenario [17], items are traded between pro- 
ducers (vendors) and retailers (customers): each transaction starts 
with a vendor sending a sales quotation to a customer. If an offer is 
accepted and the order is confirmed, then the item is scheduled for 
delivery. When ready, a logistic operator collects it. In this scenario, 
deviant traces either do not reflect the company’s rules or will po- 
tentially lead to retailers’ complaints: e.g., a late delivery complaint 
can occur only if the date the product is received is greater than 
the agreed time to receive it as registered in a previous agreement 
event. This situation cannot be directly expressed as a temporal 
pattern, as we also need to test the timestamps associated in the 
data payload. Conformance checking can be applied to several un- 
explored non-business domains, such as smart contract verification 
[10]. Most recent video games exploit AI features [13]: existing 
state of the art exploits automata [16] for modelling NoN-PLAYER 
CHARACTER’s behaviours. As Declarative models and automata are 
completely equivalent approaches, developers might exploit the 
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former to compactly represent the latter. Furthermore, as debug- 
ging AI in video games is a crucial challenge [12], conformance 
checking solutions might be exploited for debugging unexpected 
behaviours. As AAA video games already track and log both players 
and NPC actions', it might be also possible to use game logs for 
distinguishing winning strategies from losing ones [6]. As a result, 
analysis of an ongoing trace at runtime might ‘suggest’ actions 
beneficial to the player based on the game state. 

Given that conformance checking is at the heart of both trace 
validation and model mining, it is of crucial relevance to optimize 
such a task. Solutions enabling conformance checking via model 
mining through SQL queries [23, 24] neither explicitly evaluate the 
satisfiability for every single trace, nor return the traces that satisfy 
them, but only associate support and confidence values to each 
of said clauses for model mining purposes. However, as shown in 
this paper, these queries can be extended to both evaluate satisfia- 
bility per trace and return the set of traces satisfying every single 
clause, thus adhering to the definition from conformance checking 
literature (§2). In doing so, we are forced to introduce aggrega- 
tion and nesting operations, which are not generally efficient. This 
fact is supported by experimental evidence (§5.1), where we also 


Thttps://battlefieldtracker.com/ 


135 


Running Temporal Logical Queries on the Relational Model 


extend the relational representation of traces from [23, 24] (§4.1). 
Our specific contribution is then the provision of specific operators 
(xtLTL¢) rewriting existing LTL¢ operators for the relational model 
thus efficiently running conformance checking queries in Declare 
(§3.2). This is also possible through a query plan solution similar 
to [4] (§4.3), which proves to be more efficient than any solution 
relying solely on the SQL language. The Query Plan (Figure 1), 
utilises our proposed xtLTL¢ operators (§3.2), as an extension of 
the traditional LTL¢ operators (§2) that logically define a Declare 
clause (Table 1). While LTL¢ operators provide a formal logical 
temporal definition, xt LTL¢ operators are designed to exploit the 
benefits that a relational model provides. This includes optimized 
access to the tables defined in Figure 1. For example, the proposed 
operator Init, which constrains a log to begin with a specified event, 
can directly access the CountingTable, and exploit offsets to deter- 
mine the first event per trace. Traditional LTL¢ (used for a relational 
model) would require an entire log scan. 

Even state-of-the-art implementations explicitly engineered to 
solve the conformance checking problem without relying on a re- 
lational representation of traces, are not particularly efficient [8]. 
This solution, not being able to assemble the previously described 
LTL¢ operators within a query plan, can neither minimize the access 
operations to the trace data nor minimize the re-computation of 
sub-expressions that appear frequently in the model as recently 
proposed by [4]. This claim is also supported by analysing the query 
plan for more recent approaches where no evidence of query opti- 
mization over the query plan is given [9, 20]. Further experimental 
shreds of evidence support such theoretical claims (§5.2): in the 
first instance, these show that our solution is already more efficient 
than the state of the art in the literature by two-three orders of 
magnitude (hundredths\tenths of a millisecond vs. tens of seconds). 
Furthermore, by using different Declare models composed of several 
clauses accessing the same activation and target conditions, except 
the data correlations, our solution exhibits an increase in running 
time only when new data is accessed and, otherwise, it preserves a 
constant running time with fewer temporal fluctuations. 


Contributions. Our proposed solution is then implemented in 
KnoBAB?: we are synthesising logs derived from a system (be it 
digital or real) to a column-store knowledge base ad-hoc imple- 
mented for conformance checking (§4.1). In this instance, we then 
generate a conformance checking query plan generated from a 
declarative model (§4.2), be it positive or negative, so to compute 
desired properties associated with non-deviant traces (§4.3). As per 
previous remarks, declarative models represent temporal and data 
constraints that one would expect to hold as true in the non-deviant 
traces from the twinned system. As such, one can consider those 
traces returned by the query associated with the declarative model 
as correct, and the remainder as deviant. As a temporal representa- 
tion of the declarative model provides a point-of-relativity in the 
context of correctness (i.e., time itself may dictate if traces maintain 
correctness throughout the unfolding of the associated events), the 
considerations of such temporal issues significantly increase the 
time spent for checking the meeting of the requirements. Our con- 
tributions include: (i) an extension of the log representation from 
[23, 24] with a CountingTable and a column-based relational model 
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for representing data payloads (§4.1, upper part of Figure 1), (ii) a 
query compiler (§4.2) transforming each Declare model into a DAG 
query plan (lower part of Figure 1), (iii) a relational formulation of 
traditional LTL¢ as xtLTLg, and (iv) an execution engine running 
the DAG either sequentially or in parallel (§4.3) 


2 RELATED WORK 


XES Log Model. (Data) payloads are maps associating attributes 
(i.e., keys) to data values. Given a finite set of activity labels Act, 
an event o} is a pair (a,p), where a € Act is an activity label, 
and p is a payload, mapping each key to a single value. A trace 
o' is a temporally-ordered and finite sequence of distinct events 
o' = oi ---o}, modelling a process run. All events within the same 
trace associate the same values to the same trace keys. A log L£ 
is a finite set of traces { o},...,0™ }. We denote = C Act as the 
set of all the distinct activity labels in the log. If a payload is also 
associated to the whole trace, then this can be easily mimicked by 
adding an extra event containing such a payload, __trace_payload, 
at the beginning of the trace. This is evidenced from Table 2, where 
the BPIC 2012 dataset contains by default 24 unique event labels, 
but after injecting the __trace_payload event, this increase to 25. 
This characterization is compliant with the EXTENSIBLE EVENT 
STREAM (XES) format, which is the de facto standard for event logs 


[1]. 


Conformance Checking. Temporal declarative languages pinpoint 
recurring temporal patterns in highly variable scenarios so as to 
describe them compactly for both machines and humans [19]. Ev- 
ery single temporal pattern is expressed through templates (i.e., an 
abstract parameterized property: Table 1 column 2), which are then 
instantiated on a set of real activation, target, or correlation condi- 
tions. We can then categorize each Declare template from [14] by 
means of these conditions and the ability to express correlations be- 
tween two temporally distant events happening in one trace: simple 
templates (Table 1, rows 1-3) only involving activation conditions; 
(mutual) correlation templates (rows from 4 to 15), which describe 
a dependency between two activation and target conditions, thus 
including correlations between the two; and negative relation tem- 
plates (last 2 rows), which describe a negative dependency between 
two events in correlation. Despite these templates may appear quite 
similar, but they generate completely different finite state machines, 
thus suggesting that these conditions are not interchangeable’®. Fig- 
ure 2 exemplifies the behavioural difference between two clauses 
differing only on the template of choice. As a semantics, Declare 
adopts Linear Temporal Logic over finite traces (LTL¢), which in- 
terprets formulae over an unbounded, yet finite linear sequence 
of states. Given a trace o!, the evaluation of a formula g is done 
in a given state (ie., event id, or position) of the trace, and we use 
the notation o} E @ to express that ¢ holds starting from the j-th 
event of the i-th trace. We also use o! & yas a shortcut notation for 
oi E @. This denotes that the entire trace o! satisfies y. Given that 
a Declare Model is composed of a set of clauses M = { ¢] }y cn nen 
which have to be contemporarily satisfied in order to be true, we 
say that a trace o! is conformant to a model if such a trace satisfies 
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Table 1: Declare templates illustrated as exemplifying clauses. A \ p (B A q) represents the activation (target) condition, A (B) 
denotes the activity label, and p (q) is the data payload condition. 


Type | Exemplifying clause (c;) Natural Language Specification for Traces LTL¢ Semantics ([[c;]]) 
Init(A, p) The trace should start with an activation ANp 
m= Exists(A, p, n) Activations should occur at least n times F(A A pA X([[Exists(A, p, n — 1)]])) 
& Absence(A, p,n 1) Activations should occur at most n times a[[Exists(A,p,n 1)]] 
Precedence(A, p, B, q) Events preceding the activations should not satisfy the target | —=(BA p) W (AA p) 
ChainPrecedence(A, p, B,g) | The activation is immediately preceded by the target. G(X(A A p) > (BA q)) 
Choice(A, p, A’, p’) One of the two activation conditions must appear. F(A A p) V F(A’ A p’) 
= | Response(A, p, B, q) The activation is either followed by or simultaneous to the tar- | G((A A p) > F(BA q)) 
2 get. 
s ChainResponse(A, p, B, q) The activation is immediately followed by the target. G((AA p) = X(BA q)) 
& RespExistence(A, p, B, q) The activation requires the existence of the target. F(AA p) > F(BAq) 
= | ExlChoice(A, p, A’, p’) Only one activation condition must happen. [[Choice(A, p, A’, p’)]] A [[NotCoExistence(A, p, A’, p’)]] 
s CoExistence(A, p, B, q) RespExistence, and vice versa. [[RespExistence(A, p, B, q)]] A [[RespExistence(B, g, A, p) ]] 
= Succession(A, p, B, q) The target should only follow the activation. [[Precedence(A, p, B, q)]] A [[Response(A, p, B, q)]] 
~~ | ChainSuccession(A, p, B,q) | Activation immediately follows the target, and the target im- | G((A A p) @ X(BA q)) 
mediately preceeds the activation. 
AltResponse(A, p, B, q) If an activation occurs, no other activations must happen until | G((A A p) = (7=(AA p) U (BA q))) 
the target occurs. 
AltPrecedence(A, p, B, q) Every activation must be preceded by an target, without any | [[Precedence(A, p, B, q)]] A G((A A p) = X(A(AA p) W (BA q)) 
other activation in between 
3 NotCoExistence(A, p, B, q) The activation nand the target happen. a(F(A A p) A F(BA q)) 
z NotSuccession(A, p, B, q) The activation requires that no target condition should follow. | G((A A p) > 7F(B A q)) 


the LTL¢ semantics [[c;] associated to each clause’ c;. Therefore, 
the MaxImuM-SATISFIABILITY PROBLEM (Max-SAT) for each trace 
counts the ratio between the satisfied clauses over the whole model 
size. An LTL¢ formula 9g is built by extending propositional logic 
with temporal operators in bold: 


g:=ANpl|-elevg’ |eAg'|Xe|Ge|Fo|eUg’ 


where neXt (Xq) denotes that the condition g should occur from 
the next state, Globally (Gq) denotes that the condition has to hold 
on the entire subsequent path, Future (Fg) denotes that the condi- 
tion should occur somewhere on the subsequent path, and Until 
as y U g’ denotes that ¢ has to hold at least until g’ becomes 
true, either at the current or a future state. Generally, binary op- 
erators bridge activation and target conditions appearing in two 
distinct sub-formulz. Some operators can be seen as syntactic sugar: 
WeakUntil is denoted as y W g’ := 9 Ug’ V Gg, while the impli- 
cation can be rewritten as 9 => 9g’ := (-@) V(@ A 9’). Similarly to 
relational algebra, these operators also support equivalence rules, 
thus allowing to rewrite a given LTL¢ expression in an equivalent 
one that might be more efficient to compute. 

Despite this formulation has been already extended so to sup- 
port correlation constraints [8], such a solution is affected by the 
following two deficiencies: first, correlation conditions have to be 
represented alongside the target condition levels, thus hampering 
the exploitation of efficient relational database algorithms for cor- 
relation conditions via joins. Furthermore, these operators can only 
assess the validity of one trace at a time while, on the other hand, 
we might need to assess the satisfiability of multiple traces at the 
same time by composing partial results returned by every single 
operator. These operators cannot be directly exploited as query 
operators, where multiple traces are considered contemporarily. 
For this reasons, §3.2 proposes a reformulation of such operators. 


4More formally, o'  M © Ye; € M.o' & [c/]. 
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Figure 2: Traces describing the events generated by each 
hospital unit: those are temporally ordered events associ- 
ated to activity labels (boxed). Activated (or targeted) events 
here circled (or ticked/crossed). Ticks (or crosses) indicate a 
(un)successful match of a target condition. 


Data-Aware Conformance Checking. Declare Analyzer? [8] pro- 
poses one of the latest solutions for conformance checking over 
data-aware logs. Declare templates are decomposed into LTL¢ ex- 
pressions (as per the last column of Table 1), that not only contain 
event information, but a payload associated to each event per clause. 
Such solution does not exploit RDBMS’s benefits where query op- 
timisations enhance query running times. So, no possible perfor- 
mance gains by shared sub-queries is considered so to minimize 
the data access, e.g., by conveniently structuring queries in a query 
plan [4]. In addition, the authors scan all of the traces completely 
for each Declare clause, while our proposed solution minimizes 
the data access by only accessing the data relevant for running the 
model-checking query. As their solution does not exploit multiple 
queries running processes, sub-queries or entire clauses appearing 
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multiple times in the model might be recomputed multiple times, 
thus tampering with the overall running time. As per their imple- 
mentation of the LTL¢ operators, authors do not exploit efficient 
relational algebra operators when possible, as full-outer-theta-joins 
(or theta-joins) for unions (or conjunctions) with correlation condi- 
tions. Last, each clause is completely hardcoded and, as they do not 
support novel templates via the definition of novel LTL¢ formulae, 
as we instead do. The addition of further Declare clauses would 
require an entirely new implementation. KnoBAB, on the other 
hand, supports the definition of potential new Declare templates 
via configuration files loaded at warm-up, thus enabling a more 
general result that goes beyond the Declare language and that can 
be applied to any temporal specification exploiting LTLe. 

A more recent approach [3] defined specific data structures for 
a limited support of declarative queries in sublinear-time. Still, this 
approach has the major shortcoming of pre-computing the possible 
Precedence or Response queries at loading time. This approach 
does not scale up for other possible declarative templates, as this 
might require to extend the data representation with additional 
data structures. On the other hand, our proposed approach is query 
independent and supports all of the possible queries that might 
be expressed in xtLTL¢. Furthermore, this approach supports logs 
with neither trace nor event payload, thus preventing from eas- 
ily extending it with activation, target, and correlation conditions 
involving data predicates. As this approach had a limited query 
expressive power, it was not considered in our benchmarks. 


Process Mining through Conformance Checking. Some approaches 
utilise conformance checking as a mechanism to mine declarative 
models from an event log: a scoring function tests the validity 
of each possible clause over each possible trace. SQLMiner [24] 
does so via SQL queries [23] where each specified declarative tem- 
plate is converted into a SQL query. E.g., given the SQL formula- 
tion for the Response template, the query returns a table (Activa- 
tion,Target,Score) where each row (a, b, s) represents a candidate 
clause Response(a, true, b, true), and s is its score. 

Each event log, as well as each activation and target activity 
label for generating the candidate Declare constraints to be tested, 
are stored in distinct relational tables. While the former are rep- 
resented in Log(Id,Trace,Activityld,Event), the latter are stored in 
Actions(Activationld,Targetld). The authors consider Support and 
CONFIDENCE scoring functions to determine the precision and relia- 
bility of the calculation. Records which do not pass pre-determined 
Support and ConFIDENCE thresholds are filtered out from the data. 
While SQL also supports data constraints, this solution considers 
Declare clauses with neither activation, nor target, nor correlation 
ones with payload predicates. This problem is also shared with 
more recent approaches where, despite SQL syntax is extended, no 
evidence of data predicates is given [20]. 

Despite the authors exploit data perspectives in ‘Resource As- 
signment Constraints’ clauses, distinct from the Declare ones, only 
trace payload conditions are considered. Instead, KnoBAB sup- 
ports payload information and predicate testing both per trace and 
per event (see §2), which could also be stored in a separate table 
as SOLMiner suggests, thus providing greater expressiveness per 
clause. SQLMiner queries can be chained together using SET UNION, 
though this provides no possibility for testing which are the clauses 
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that are satisfied by the majority of the traces (Max-SAT). These 
query plans are not optimized as in [4], thus failing at both minimiz- 
ing the data access and running multiple shared sub-queries only 
once. This is inferior to KnoBAB, which has the ability to process 
multiple declarative clauses from disparate templates. 


3 LOGICAL MODEL 


3.1 (Intermediate) Result Representation 


Within the computation pipeline, (intermediate) results are repre- 
sented as a set of triplets (i, j, L) representing that, starting from 
event a} in trace o, we might observe activation, target, or cor- 
relation conditions in L, an ordered vector. While for activation 
and target we preserve the matched event id, correlations keep 
track of both the activation and the target condition leading to the 
satisfaction of a given © predicate (see the next section). This is a 
sensible representation, as per declarative constraints, it may exist 
only one possible © predicate. Such triplets are sorted by trace id 
and event id, and operators manipulating those (§3.2) guarantee 
that only one triplet should appear per unique trace and event id. 
This guarantees efficient join operations across different intermedi- 
ate results, as well as efficient counting of the satisfied conditions 
for each trace. E.g., Clause © from Figure 1 requires access to just 
AttributeTables, as all of the activity labels are associated to data 
conditions. The offset from the attribute tables can then be used 
to identify the trace and event associated to the data condition (if 
fulfilled). When we want to return events for which P12 holds, we 
need to only consider the data associated to Lumpectomy events 
having a positive biopsy and levels of CA15.3 greater than 50. This 
will require the intersection of the events related to biopsy with the 
ones related to CA15.3. The selected rows are then converted into 
the intermediate result representation ad intersected; in this situ- 
ation, we only obtain { (3, 3, {A(3)}) }, as the only event meeting 
such requirements is the third from the third trace. As we are going 
to see in the next paragraph, A is the container of matched acti- 
vation conditions. Similarly, Pg will return { (1,3, {A(3)}) }, thus 
obtaining { (1,3, {A(3)}, (3, 3, {A(3)})) } asa final result associated 
to ©: this remarks that only traces © and @ describe patients that 
underwent a surgical operation under such conditions. 

Our proposed representation is different from the one provided 
by [8] which cannot represent for each event within a trace all 
the possible activation, target, or join condition happening in the 
future, as it is impossible to represent single trace events that are 
not necessarily represented by activation or target conditions. As 
observed in §2, this information is required for checking the satisfi- 
ability of g while jointly visiting both the trace (now represented 
as subsequent rows in the result representation) and the formula. 
In fact, authors exploit a hash map of hash maps, associating each 
trace to the collected activation conditions which, in turn, might 
be associated with further target conditions. This solution is even 
less efficient than exploiting sorted linear data structures. 


3.2 eXTended LTLs operators 
$= Initasr(A,p) | Endayr(A,p) | Exists 47(n, A, p) | Absence, /7(n, A, p) 
| Next(#) | Globally(¢) | Future(¢) | Not() 
| Or(g, $’, @) | And(d, ¢’, ©) | Until(¢, ¢’, ©) 
| AndGlobally(¢, $’, ©) | AndFuture(¢, 6’, @) | AndNextGlobally(¢, ¢’, ®) 
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Algorithm 1 xtLTL¢ pseudocode implementation for the basic 


timed operators 


1: function FuTuRE(¢) 

2: for all (t,e,L) € ¢ do yield (t,e,U { L’ | (t,e’, L’) € f and e’ > e }) 
3: end for 

4: function GLoBALLY(¢) 

5: for all (i,e,L) € ¢ do 

6: E«{e’| (te’,L’) eg ande’ >e} 

7: if |E| = &; — e then yield (t,e, { L’ | (t,e’, L’) € ¢ and e’ € E }) endif 
8 end for 

9: function NExT(¢) 

10: for all (t,e,L) € ¢ s.t. e > 1 do yield (it, e — 1,L) 

11: end for 


12: function CommonJo1n(¢, ¢’, ©, isDis junctive) 


13: it Iterator(¢), it’ <Iterator(¢’) 

14: while it # 0 and it’ # 0 do 

15: (t,e, L) — current(it), (t’,e’, L’) — current(it’) 
16: if t=t' ande =e’ then 

17: LY =0 

18: if © # true and L # 0 and L’ # 0 then 

19: for all A(m) € L and T(n) € L’ s.t. O(m, n) do 
20: L” —L” U{ M(m,n) } 

ai: end for 

22: else 

23: if L= 0 then L” < {A(e)} else L” —L 
24: if L’ = @ then L’” — LU {T(e’)} else L’” — L’” UL’ 
25: end if 

26: if L’’ # 0 then yield (t, e, L’’); 

27: next(it); next(it’); 

28: else if t < t’ or (t=t’ and e < e’) then 

29: if isDisjunctive then yield (t,e, L) end if 

30: next(it) 

31: else 

32: if isDisjunctive then yield (t’, e’, L’) end if 
33: next(it’) 

34: end if 

35: end while 


36: function AND(¢, 6’, ©) CommonJorn(¢, ¢’, O, false) 
37: function Or(¢, 6’, ©) CommonJorn(¢, ¢’, ©, true) 


38: function UNTIL(¢, ¢’, ©) 
39: for allt s.t. (t,i’,L’) € ¢’ do 


40: a — 1; Map — {};i — min, (t,1,L) € ¢’; 1 — max, (f,4L7) 1 
41: while i < I do 
42: if a =ithen 
43: Mapa] — Map[a] UL’ 
44: i — min,,>; (t,1,L) 
45: else if exists (t, j,Lj) € #s.t. j <ithen 
46: if (t,a,Lq),(t,a 1,Le 1),..., (t,i-1,Li-1) € ¢, 
and O(i, j) for all T(j) € Lg U--- ULj-4 then 
47: Map[a] — Map[a] U { M(k,i) | T(k) € Lg U-+ ULj-1 } 
48: i min,,>; (t,4,L) € ¢’ 
49: elseaca 1 
50: end if 
51: else a i 
52: end if 
53: end while 
54: for all (i, L) € Map do yield (t, i, L) 
55: end for 
56: end for 


We extended LTL¢ operators (xtLTL¢) directly exploited by our 
pipeline. Operators in the first line filter traces’ events and represent 
these into the previously-described result representation. Init (End) 
returns the events at the beginning (end) of each trace satisfying 
the condition A A p. Similarly to [8], each of these operators might 
be expressed as either an untimed or as a timed specification. Any 
operator will be considered timed by default when appearing inside 
a timed operator, like Next, Globally, Future, Until, and any other 
composed operator from the last line. E.g., In Figure 1, Exists(1, P3) 
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is a shorthand for Exists(1, FollowUp, —co < CA-15.3 < 23.5), as 
each atom always associates an activity label to a payload condi- 
tion. The operator associated to Absence(1, P3) is untimed, while 
the Exists(1, P3) descendant of Globally is timed. While the timed 
definition returns a tuple (i, j, L) for each possible event a} within 


the trace o! where the formula holds, the untimed specification 
only checks whether the formula holds at the beginning of the 
trace. E.g., untimed Exists (Absence) returns the first event trace if 
at least n (at most n—1) events satisfy AA p, while the timed version 
returns the events satisfying (not satisfying) A A p (always n = 1). 
All of these operators might be optionally marked as returning ei- 
ther an activation (A) or a target (T) condition, so that each (i, j, L) 
triplet has L = {A(j)} or L = {T(/)}; when no mark is specified, L 
is empty. To wrap up the previous example, the timed Exists(1, P3) 
will list the events where P3 happened, { (1,3, {A(3)}) }, while the 
untimed version will just list the traces where such event happened 
and collect the event of interests in L. 

The next two lines report the same operators described in §2 
with the addition of the explicit correlation conditions over activa- 
tion and target conditions for each binary operator. Algorithm ?? 
provides implementations of the timed versions of such operators, 
due to lack of space untimed versions are not provided, yet avail- 
able in our codebase: please observe that Next(¢) keeps unaltered 
the activation and target conditions from ¢ and just returns the 
events where ¢ happens as a subsequent step. Any binary operator 
supports © conditions: And (and Or) can be expressed as a (full- 
outer-)O-join algorithm over the activation and target conditions 
stored in L associated with the same event. If at least one activa- 
tion condition matches one target condition from the same event, 
those are expressed as a marked correlation condition M(i, j) which 
is then returned by the join. Regarding the same Choice clause 
from Figure 1, the correlation condition © associated to Or is then 
computed for each activation/target match, and if the condition is 
passed, the resulting match is added to L. 

The remaining operators merge multiple operators together 
when a specific implementation outperforms the execution of 
the operators separately: e.g., AndFuture(¢, ¢’,@) is equivalent 
to And(¢, Future(¢’), @), but preliminary experiments reveal that 
the former has a more efficient implementation than computing 
the latter. This choice was inspired by relational algebra, where 
0-joins are usually more efficient than performing a join and a 
selection operation separately. On the other hand, Implies(¢, ¢’, ©) 
is rewritten as Or(Not(¢), And(¢, ¢’, ©), true). As per previous 
discussion, the left leaf of AndFutureg in Figure 1 returns all 
of the referral events with CA 15-3 above the safeguard levels, 
{ (1,2, {A(2)}) , (3, 2, {A(2)}) }, while the right leaf returns just the 
follow-up events below such levels, { (1,4, {A(4)}) }. The operator 
AndFutureg will then return only { (1, 2, {M(2, 4)}) }, as only the 
first trace will have a decrease below the safeguard levels from 
referral to follow-up. Each xtLTL¢ operator is going to both return 
and/or accept data in the result representation, thus making such 
operators closed on such format. 


4 KNOBAB ARCHITECTURE 


The methodology behind its design systematically follows the major 
architectural components of a relational database, with the only 
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bespoke characteristics of tailoring such solution to the specific 
problem that we intend to solve (§4.3), that is, computing either the 
Max-SAT for each log trace, or the CONFIDENCE/SuPPORT associated 
to each model clause, or computing the traces satisfying all of the 
model clauses (conjunctive query). 


4.1 Data Loading 


The data loading phase loads logs serialized in multiple formats, 
thus including the XML-based XES standard, a tab-separated events’ 
activity labels, and the HUMAN READABLE LoG ForMAT (HRLF) 
firstly introduced in [7]. We use different data parsers, which are 
still linked to the same data loading primitives. HRLF also sup- 
ports the bool data type. This is represented as an integer, where: 
val < 1.0 = false,val > 1.0 = true. In Figure 1, booleans are 
displayed in their traditional way (both in the payload and for 
activation/target conditions), though this is for visual purposes 
only E.g., Exists(1, Referral, biopsy = true) in our pipeline is Ex- 
ists(1, Referral, biopsy > 1.0). 

If the log does not contain data payloads, the entire log can be rep- 
resented into two relational tables, CountingTable(Activityld,Trace, 
Count) and ActivityTable(Activityld,Trace,Event,Prev,Next). While 
the former counts the occurrence of each activity label in = for each 
trace, the latter lists all of the possible events similarly to SQLMiner. 
Both tables compactly represent the initial three columns as a 64-bit 
unsigned integer, which is also used to sort the tables in ascend- 
ing order. A row (a, j,h) from CountingTable states that there are 
h events exhibiting the activity label a in the trace o/; each row 
(a, j, i,q, q’) from ActivityTable states that the i-th event of the j-th 
trace (o} = (a, p)) is labelled as a, while q (or q’) is the pointer to the 
immediately preceding of}, (or following, oe 41) event within the 
trace if any. NULLs from Figure 1 in ActivityTable highlight the start 
(finish) event of each trace, where there is no possible reference 
to past (future) events. Trace payload information is injected (as 
an event) before the first event, which is also contained: all trace 
payload events contain NULL as Prev. 

If, on the other hand, the log is associated to either trace or event 
payloads, we exploit the query and memory-efficient column-based 
model [11], thus representing all of the values v associated to a pay- 
load key k within the rows from AttributeTablek. In our implementa- 
tion, each row (a, v, i) from AttributeTablek(Activityld,Value, Offset) 
represents a value v associated to the key k, where i determines the 
location where the event containing the accessed value is located in 
ActivityTable; this provides the trace id and event id required for the 
intermediate representation. To perform payload-based queries effi- 
ciently, the table is sorted in ascending order by the three columns. 
As each data condition is always associated with a given activity 
label, those can be effectively run as data range queries run via 
binary search algorithms. From Figure 1, all the attributes are stored 
in distinct tables. Value can contain multiple data types, but each 
attribute is associated to only one type. 

CountingTable is mainly accessed for existential and Exists and 
Absence templates where no data payload is specified, while Ac- 
tivityTable is used for either returning all of the events within the 
log associated to a given activity label or returning all of the events 
happening at either the beginning or at the end of a trace. Each 
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table AttributeTablek, on the other hand, will return all the events 
satisfying a given condition associated with a specific key k. 

After loading the whole dataset, the number of the traces within 
the log |L], the length £; for each trace o/, and the number of 
distinct activity labels |x| is known. Given this, we can get the 
number of occurrences of each i-th activity label from = in each 
trace by directly accessing the rows within the CountingTable in 
Figure 1 are within the range [|£| -(i-— 1) + 1, |L|- i]. The offsets 
for accessing the Mastectomy activity label in CountingTable is 
[3 - (4-1) +1,3- 4] = [10,12]. Given that this counting table 
computes only for untimed operations, the intermediate result 
for untimed Exists 4(1, Mastectomy, true) is { (1,1, 0) }, as only © 
contains such an event. On the other hand, the loading and indexing 
phase generates an ActivityTable associated with two indices, a 
primary and a secondary index. While the former returns all of the 
events associated with a specific activity label, the latter accesses 
either the first or the last event in a trace. Pointers associated with 
each record enable traces’ temporal scan. 


4.2 Query Compiler 


The query compiler is structured into three main phases. (i) The 
atomization pipeline rewrites the data predicates associated with 
each activity label as a disjunction of mutually exclusive data con- 
ditions. We can tune KnoBAB to always atomize each possible 
activity label if it exists any Declare Constraint associating it to a 
data condition as in [7], or we can choose to provide such an inter- 
val decomposition only to the Declare constraints exhibiting data 
conditions. While the former approach will maximise the access 
to the AttributeTables, the latter will maximise the access to the 
ActTable. By doing so, we can ensure that the data satisfying some 
given properties can be visited at most once, thus guaranteeing the 
assumptions from [4] also at the data accessing level. Correlation 
conditions do not undergo this rewriting step. The atomized model 
in Figure 1 replaces the non-correlation data predicates with the 
outcome of the atomization process as in [7]. 

We (ii) rewrite each Declare constraint as a xtLTL, formula, 
where the activations (and the potential target) conditions are in- 
stantiated with either just activity labels or also with associated 
data conditions as per the previous atomization step. Each sub- 
expression appears at most once as in [4] by representing every 
single node in the query plan at most once: this is ensured by an in- 
ternal query manager cache. The resulting query plan considering 
the simultaneous execution of multiple queries can be represented 
as a DirECT Acyclic GrapH (DAG). For each declarative clause 
appearing more than once (e.g., m > 1), the associated xtLTL¢ ex- 
pression will be computed at most once, while its resulting data is 
going to be accessed m times by the final aggregator: as per Figure 1, 
despite RESPONSE might be considered a subquery of SUCCESSION, 
the Max-SAT is still going to retrieve the output provided by the 
associated sub-expression. Green arrows remark operators’ out- 
put shared among operators. Please also observe that operators 
with the same name and arguments but marked either with acti- 
vation, target, or no specification are considered different as they 
provide different results, and therefore are not merged together. 
This includes distinctions between timed and untimed operators. 
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Given that our execution engine provides the possibility of run- 
ning a query plan in either a parallel or a sequential mode, we 
need an additional step. (iii) The previous DAG represents a de- 
pendency graph, where a link between an ancestor and one of its 
descendants implies that the latter has to be computed before the 
former, thus suggesting an execution order. Figure 1 depicts this as 
an arrow starting from the ancestor. To enforce that, we perform a 
lexicographical order over the DAG, through which we compute 
the maximum depth level associated with each node of the graph. 
We then represent the query graph as a stack of depth levels, where 
each operator on it can be run in parallel alongside its siblings. This 
proves that the computation of Declare Clauses can be reduced into 
an embarrassingly parallel problem, as the layered execution guar- 
antees that no thread communication needs to happen, and that 
multiple threads could access contemporary the partial results as- 
sociated with the immediately-descendant operators, as the former 
will return all of the events where the condition happened, while 
the latter will just return the trace event satisfying such condition 
alongside the required activations/targets listed in L. Furthermore, 
the proposed parallelization ensures minimizing the data access for 
computing the query. The DAG Figure 1 depicts a query plan. 


4.3 Execution Engine 


At the time of the writing, KnoBAB supports four different types 
of model aggregation queries: Conjunctive Query, Max-SAT, Con- 
FIDENCE, and Support. As we will see at the end of the subsection, 
these will not require a change on the query plan, but just a differ- 
ent way to integrate the intermediate representation ¢; returned 
by each declarative clause c;. 

First, the execution engine takes both the relational database 
resulting from the data loading and the DAG returned by the query 
compiler, and uses the leaf nodes from the latter to access the former. 
By query plan construction, all of the relevant data parts are going 
to be accessed at most once and then transformed into the expected 
intermediate result representation. Second, the intermediate results 
are propagated from the leaves towards each root node associated 
with a declarative clause cy. Any intermediate representation is 
always associated with each operator returning it as a temporary 
primary-memory cache. Each intermediate cache might be com- 
pletely freed if we are not computing a CONFIDENCE query and 
if the furthest ancestor has already accessed it, or if it is a cache 
non-associated to an activation required by CONFIDENCE and the 
furthest ancestor has already accessed it. Third, when the computa- 
tion will finish running the shallowest DAG depth level containing 
the xtLTL¢ root associated with the entry-point of each declarative 
clause cz, each of these operators will have an intermediate result 
ox stating all the traces satisfying cx. 

The Conjunctive Query will return the traces satisfying all of 
the Declare clauses via the intersection of all of the clauses via 
And and true as a © condition. Max-SAT will count, for each log 
trace oj, the intermediate results ¢; associated with each clause cz 
containing it, and then provide the ratio of such value over the total 
number of the model clause |M|. By denoting as ActLeaves(¢;) the 
untimed union of the intermediate results returned by the activation 
conditions for the declare clause cy, € M, the CONFIDENCE for cx is 
the ratio between the total number of traces returned by ¢, and the 
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Competitor Dataset Traces |£| Events Distinct Activities |>| 
BPIC 2011 (original) 1143 150 291 624 

. BPIC 2011 (10) 10 2613 158 

SQL Miner BPIC 2011 (100) 100 12195 276 
BPIC 2011 (1000) 1000 133 935 607 

Declare Analyzer BPIC 2012 (original) 13087 262 200 24 


Table 2: Range of datasets used for benchmarking. 


overall traces containing an activation condition. Dividing the total 
number of traces returned by ¢ x by the total log traces returns the 
Support. Once each $x per clause cy, is computed, the aggregation 
functions can be then expressed as follows: 


ConjQuery(¢1,...,¢n) = And(¢1,...And(¢n-1, on, true), true) 
|{k | Aj,L. Gj, L) € oe } | 
ote Ll 


Max-SAT(ay-++.4n) =( 


IM| 
_ (GLA L GIL) € be } | 
een ene | [ActLeaves($) es 
SupPoRT($1,...,Qn) = (4 i | FL. GD) € $e} 7 
ILI creM 


As the user in Figure 1 asks the ratio between satisfied clauses over 
the model size, the query plan exhibits a Max-SAT aggregation. 


5 EXPERIMENTAL ANALYSIS 


Our benchmarks exploited a Razer Blade Pro on Ubuntu 20.04: Intel 
Core i7-10875H CPU @ 2.30GHz - 5.10 GHz, 16GB DDR4 2933MHz 
RAM, 180GB free disk space. Our datasets (Table 2) include 2 real 
life event logs®: BPIC 2011 (Dutch academic hospital log) and BPIC 
2012 (Dutch loan company). 


5.1 SQLMiner 


These experiments want to test our working hypotheses for the 
possibility of engineering a tailored relational database architecture 
that can outperform process mining through conformance check- 
ing running on traditional relational databases. In the latter, no 
LTL¢ operators are exploited but a table similar to ActivityTable 
is exploited. Given that the SQL provided in [23, 24] might only 
return the Support associated with each candidate Declare clause 
(SQLMiner+Support), we provided the least possible changes to 
also associate each candidate clause with the set of all the traces 
satisfying it. This was achieved by both extending the activation 
condition expressed in SQL and using array_aggr included in 
PostgreSQL 14.2 to list such traces (SQLMiner+TraceInfo). For 
comparing the same settings in KnoBAB, we run both Max-SAT 
and Support queries with the difference that, in our case, both of 
these implementations will always return, per intermediate result 
specification, the trace information satisfying each possible model 
clause. For our experiments, we exploited BPIC 2011 dataset from 
[24]. To test the scalability of the solutions, we recorded the query’s 
runtime with increasing log size: we randomly sampled the log 
with three sub-logs containing 10, 100 and 1000 traces, while guar- 
anteeing that each sub-log is always a subset of the greater ones. 
For each sub-log, we generated 8 distinct models as benchmarked 
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Alternate Precedence Alternate Response Chain Precedence Chain Response 


OOM (1.56e+06) OOM (1.25e+06) OOM (1.58e+06) OOM (1.66e+06) 


o 

= 

. Not Succession Precedence Responded Existence Response 
= 


1e+05 - 
1e+03 - 


to 100 1000 10 100 1000 10 100 1000 10 100 1000 
Log Size 


= KnoBAB + Max-SAT -®- KnoBAB + Support -®- SQLMiner + Support -*- SQLMiner + Trace Info 


Figure 3: KnoBAB vs SQLMiner Performance for 25 clauses 
with frequent activity labels with Support and Trace Infor- 
mation. OOM indicates Out of Secondary Memory for logs 
containing 10° traces (followed by the time taken for an ex- 
ception to occur). 


in [23]. Each model consists of 25 clauses instantiating the same 
Declare template (elected template) with different activation and tar- 
get conditions. Those did not consider payload conditions and were 
only considering the most frequent activity labels appearing in the 
sub-log. Models of greater size caused an exponential increase in 
required secondary memory for SQLMiner (on the order of TB), jus- 
tifying our approach for a sampled model. In their approach, each 
model was queried by running the SQL query corresponding to the 
elected template, and the specific activation and target conditions 
from the model’s clauses were distinct rows in the Action table. 
The outcome of such experiments is represented in Figure 3, 
where each plot represents the running time associated with mod- 
els containing the same elected template. In the worst-case scenario 
(Response), we exhibit similar query running times to SQLMiner. 
Even so, we are always providing trace information, and in the case 
of Response, altering the SQL to provide this causes over an order 
of magnitude increase in complexity. In the best-case scenario, we 
outperform SQLMiner by at most 5 orders of magnitude. This is 
because our query plan minimizes the access to the data queries 
and our computation avoided explicit computations of aggrega- 
tions. This was achieved by sorting the intermediate results, and, 
as our operators’ implementations guarantee that (intermediate) 
results are always sorted, counting operations are just linear scans 
of the intermediate result representation. Our solution never ex- 
ceeded the 16GB of primary memory while, for some more complex 
queries (top row of Figure 3), SQLMiner exceeded it, thus proving 
that our solution is also memory efficient. One of the outstanding 
examples is RespExistence, where we are greatly more efficient 
than SQLMiner. This is a clear indicator of the potential gains from 
utilising our proposed CountTable, summarizing the appearance of 
activity labels in events per trace by counting their instances. The 
original SQL query is required to scan the whole Log table (similar 
to our ActivityTable), which contained all of the trace events. We 
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Figure 4: KnoBAB vs Declare Analyzer Performance. 


also remind the reader that the CountTable can be efficiently created 
while scanning the whole log dataset, so no super-linear overhead 
is added at loading time. This further validates that an adequate 
tabular representation twinned with xtLTL¢ operators extending 
the LTL¢ specification for tabular data provides a suitable solu- 
tion. Last, the running time of the Max-SAT problem and Support 
for KnoBAB exhibited similar running times, while in PostgreSQL 
those exhibit huge variations depending on the query-plan rewrit- 
ing performed by the PostgreSQL query engine. For some elected 
templates, our proposed SQLMiner+TraceInfo formulation proved 
also to be more efficient than the SQLMiner+Support queries origi- 
nally proposed by [23] (which contain no trace information). 


5.2 Declare Analyzer 


The set of experiments on Declare Analyzer have the aim of compar- 
ing our proposed solution against a solution tailored for solving De- 
clare Conjunctive Queries over logs running exclusively in primary 
memory. We chose to exploit via MapDB’ for log representation, 
thus making it more similar to a relational database. We exploited 
the BPIC 2012 dataset, defined in Table 2, also used in [8]. The 
data was modified so as to efficiently act across the trace payload 
information. [8] requires the injecting of trace payload information 
into each event. Our implementation, as stated in 2, injects the trace 
payload as a unique event at the beginning of the trace. The queries 
(from the same paper) were edited®, where all the models M; and 
Mj with i < j are always the former a subset of the latter, while 
Mi+1 increases by 5 from Mj. Our experiments indicate that, over- 
all, we are 2-3 times orders of magnitude more performant than 
DeclareAnalyzer. The conjunctive query denoted as KnoBAB+CQ 
demonstrates greater performance that KnoBAB+Support, as the 
calculations required for the support values per clause are more 
costly for smaller models. Though this is only within the order of 
the milliseconds. For an increase in model size, Declare Analyzer 
has a much greater time increase than KnoBAB (the best case for 
Declare Analyzer is over an order of magnitude greater than that 
of KnoBAB). While the linear interpolation of Declare Analyzer 
provides a slope of 3.47 - 10* ms per model size, KnoBAB provides 
a slope of 10!, thus providing an inferior overall growth rate. To 
explain the abrupt time increase from M; to M2, we encourage 
the reader to refer back to the query plan from Figure 1.With each 
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increase in model size, entirely new event labels and data payloads 
are considered (albeit the conditions within each sub-model are 
similar). As KnoBAB thrives when data access can be limited, the 
addition of new data requires more decomposition within the atom- 
ization pipeline, and, as more atoms are now considered, querying 
will also suffer as more data is going to be accessed. As a result, 
the complexity increase is worse than examples tailored to benefit 
data access limiting as in the previous scenarios where queries 
were sharing multiple and frequent activity conditions. Still, De- 
clare Analyzer will always completely scan all the events by design 
despite, for some queries, we might exclude scanning irrelevant 
trace events. 


6 CONCLUSIONS AND FUTURE WORKS 


We propose KnoBAB, a fully relational database architecture for 
computing Conformance Checking via conjunctive queries, as well 
Max-SAT and clause CONFIDENCE/SupporT functions. KnoBAB con- 
sists of a data loader and indexer, query compiler, and an execution 
engine, thus fully matching the architecture of a relational database. 
This solution was enabled by the extension of the traditional LTL¢ 
operators, providing algebraic semantics to declarative temporal 
models, so as to support data operations over tuples representing 
trace events. Our solution is not limited to one single declarative 
language of choice, as it might support any possible model that can 
be expressed via xtLTL¢ operators. Based on the latest solutions 
in current database literature, the query plan was also designed 
to minimize the data access by running the common sub-queries 
at most once. KnoBAB outperforms state of the art solutions both 
tailored to the specific dataset or based on traditional relational 
databases running SQL queries. This solution will enable us to learn 
models exploiting abductive reasoning rather than traditional min- 
ing techniques, thus also providing safety guarantees over noisy 
data and models that are inconsistency free [18]. 

Future works will provide extensive benchmarks for bigger log 
datasets and will provide speed-up results for the parallelized exe- 
cution of the resulting query plan: despite this being already im- 
plemented, we postpone those results due to the lack of space in 
the present paper. For the time being, the logs available from the 
research community are quite compact, and therefore the whole 
dataset is well fit in primary memory. Dealing with actual big data 
solutions or bigger models will require us to migrate the data store 
location to secondary memory, thus requiring the adoption of Near- 
Data Processing techniques [5]. As part of the data-loading phase, 
HuMAN READABLE Loc Format key names currently only support 
strings consisting of letters. A proposed extension would allow for 
any possible string name, including numbers and symbols. 

The adoption of relational databases and operator-based query 
plans might enable incremental trace updates so to extend those 
at runtime: this open research problem can be now solved by ex- 
ploiting algebraic rewriting rules similar to the ones from relational 
databases, thus requiring a formal definition of xtLTL¢ operators. 
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ABSTRACT 


The rapid development of autonomous systems and their presence 
in human life force them to make quick decisions based on Big 
Data. Many of these decisions involve moral judgments that are 
then transformed into specific actions. Side effects of the choices 
made by the decision systems can be dangerous, so we have to be 
very careful when increasing the capacity of these systems. The 
decisions the autonomous systems make should be as ethical as 
possible. 

This paper adapts the observe-orient-decide-act (OODA) loop 
to the decision-making process in the moral area. It combines the 
parameterization of cognitive aspects of autonomous systems with 
ethical standards and moral inference. Problems related to the im- 
plementation of moral inference to autonomous systems, including 
artificial intelligence (AI) systems, are presented. Thanks to the 
adaptation of the OODA loop, it is possible to make morally correct 
decisions and actions based on a set of ethical principles adjusted 
to a specific situation. 

The presented proposal allows for moral inference, which ex- 
tends the possibilities of autonomous systems that use the inference 
loop, especially those processing Big Data. The decision-making 
system still has the possibility of a choice aimed at doing more good 
or less evil. 


CCS CONCEPTS 


+ Applied computing — Multi-criterion optimization and 
decision-making; - Information systems — Decision support 
systems; « Computing methodologies — Knowledge representa- 
tion and reasoning; »« Computer systems organization — Robotic 
autonomy. 
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1 INTRODUCTION 


Nowadays, man and autonomous decision-making systems often 
follow the same paths. Intelligent machines can be found in many 
areas of human life: in medicine, business, and everyday life, e.g. 
autonomous vehicles. In many cases, these devices make the de- 
cisions. Some of them have moral consequences. In such cases, 
the decision-making system should conclude before taking action 
whether the decision and its implications are ethical. 

Conducting a moral discourse in the area of information systems 
requires the analysis of several topics [17], including current ethical 
standards, intelligence, and morality, applied to machines, as well 
as the personality of AI systems, and ultimately understanding how 
self-learning decision systems process collected Big Data. 

This paper contributes to the area of the decision-making pro- 
cess of autonomous systems using Big Data. It indicates problems 
related to the implementation of moral inference elements to AI 
systems. As one of the solutions, it proposes to adapt the OODA 
loop as a cycle that allows for simple moral reflection in a specific 
situation. The OODA loop is currently used in Big Data analysis to 
boost the outcomes [21], as well as to accelerate decision-making 
processes [7]. The presented considerations are also intended to 
encourage researchers to adapt and implement the elements of 
morality in the reasoning algorithms applied in the autonomous 
devices. 

The paper is organized as follows. Section 2 presents the OODA 
loop and the main problems of its usage in autonomous systems. 
Section 3 describes the concept of general AI. Section 4 classifies the 
decision-making systems in the structure of intelligence. Section 
5 presents problems with modern ethical standards. The relations 
of morality to the autonomous systems are presented in Section 6. 
Section 7 shows the issue of the electronic personality of AI systems. 
The proposal for the adaptation of the OODA loop to the decision- 
making process in the area of morality is presented in section 8. 
Section 9 is a discussion and comparison of the results obtained 
in other works. The outcomes of the research are presented in the 
Conclusions section. 


2 OODA LOOP IN THE DECISION-MAKING 
PROCESS OF AUTONOMOUS SYSTEMS 


The OODA loop is an acronym for four activities: observe, orient, 
decide and act. This cycle connects acquisition and analysis of 
information, decision selection, and action. Although this solution 
was originally designed for military purposes, it is currently used, i.a. 
in describing learning processes [2]. This methodology is suitable 
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for describing the decision-making process performed by a single 
actor. 


2.1 OODA loop stages 


The first stage of the OODA loop is observation, i.e. obtaining infor- 
mation from the environment. It is usually carried out by various 
sensors that replace human senses. At this stage, an important issue 
is the vulnerability of the systems to unreliable or intentionally 
crafted information. In the case of autonomous systems, it cannot 
be limited to receiving information only from selected sources, 
as this may result in the exclusion of relevant information. This 
stage is crucial because conclusions and decisions are made on 
these data. Alternatively, the information held by the system can 
be supplemented with new data collected from the environment. 
Another important part of this stage is to define and determine the 
parameters that will be used in the next step of the loop [3]. 

The second stage is the orientation, i.e. synthesis and analysis 
of the collected information, in order to build an up-to-date per- 
spective of the situation. Autonomous systems do this with the 
use of weights that are assigned to individual parameters. The 
weights are usually set by the manufacturers. The user can also 
define weights for individual parameters if the producer allows it. 
At this stage, crowdsourcing is also used to distinguish between 
weight values [26]. 

The third step is to choose a decision, which consists of determin- 
ing the actions that should be performed based on the orientation 
in a situation. Usually, there are several possible choices, and the 
probability of success is determined for each of the possibilities. 
The decision algorithm is based on the potential success and less 
damage [3]. Problems arise when making decisions related to fuzzy 
attributes, such as giving joy or pain. In this case the producers 
shift the final decision to the human being. 

The last stage of the OODA loop is an action. It consists of 
implementing the selected choice of action. This usually boils down 
to direct interaction with the environment. 


2.2 The problems of using the OODA loop in 
autonomous systems 


The use of the OODA loop in autonomous systems causes prob- 
lems. One of them is the reduced control of the device and the 
predictability of its decisions in the case of multiplication of the 
loop usage [15]. 

The first problematic area is the initial stage of OODA, i.e. ob- 
taining information. First of all, receiving information from other 
systems or external sensors is problematic In such a situation, it is 
difficult to make sure that the obtained data is correct; it can only 
be trusted. In the case of already collected data, it is necessary to 
keep them up-to-date. Not only the data itself become outdated, but 
also the context and relations between the data become outdated, 
too. 

In the second stage, the problem is the adequate assignment 
of weights to various data collected for inference. It is possible to 
create general classes of collected data and the weight ranges that 
can be assigned within the classes, but the weight determination 
has to be done dynamically, ad hoc. Predefining weights do not 
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work well in dynamic situations and the context, which may also 
cause differences in weights assigned to individual parameters. 

In the third stage, ie. in the decision-making step, attention 
should be paid to the problem of inference from a large number of 
parameters. Some parameters are undefined and appear as new in 
the previous stage of the loop. Various interpretations of the existing 
situations and unclear norms or practices concerning the same area 
of interest also pose a problem. Ambiguity in the case of algorithms 
also causes complex problems in autonomous decision-making 
systems. Another problem at the decision stage is the need for 
efficient decision-making. It takes time to collect data, interpret it, 
assign weights, and analyze it. There are situations, when a decision 
should be made extremely quickly, and it will be of significant 
consequences. This is a performance challenge for AI machines on 
the one hand, and an optimization challenge for their manufacturers 
on the other. 


3 THE IDEA OF GENERAL ARTIFICIAL 
INTELLIGENCE 


The concept of intelligence is nowadays defined as the ability to 
perceive, analyze and adapt to changes in the environment, or 
as the ability to understand, learn and use the knowledge and 
skills in various situations [20]. Psychologists of the 20/" century 
distinguished many types of intelligence [6], such as social or motor 
intelligence. The definition of AI appeared in the 1950s [16], when 
the subject has been transferred from man to machine. From that 
moment on, AI and natural intelligence should be distinguished. 
The main goal of AI engineers is to create systems with human-like 
intelligence, although it is not the only type of intelligence. 

AI, similar to human intelligence, is being developed today in 
two areas [26]. The first of them is narrow AI, which maps the 
features of human intelligence only in predefined ranges: actions, 
choices, or situations. However, in addition to devices that are 
adapted to actions and behavior only under certain conditions, 
there is a need for devices that are close to the second category, 
the so-called general AI. So far, it has not been possible to produce 
general AI by artificial means, mainly from technical constraints. 
Nevertheless, some features of natural intelligence are possible to 
design and implement [11]. 

Both human beings and devices share common elements. The 
key common point is collecting information about the context (en- 
vironment, conditions, situation, etc.), processing it, and making 
decisions and actions based on the information processed. When 
designing devices with AI, engineers have to consider the assump- 
tion that an abstract approach to intelligence is never separated 
from emotions, the environment, and other people [26]. Therefore, 
this issue has to be approached holistically, just as a human being 
perceives the reality and functions in it. 


4 THE SELF-LEARNING AUTONOMOUS 
SYSTEMS 


The way to create general AI is most evident in the generation of 
devices that pretend to be fully autonomous systems. When an- 
alyzing this group of devices, we should start with the broadest 
type, namely interactive systems. These systems actively and spon- 
taneously interact with their environment. They often enter the 
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sociological sphere and everyday life of a person. Going further, 
such devices as flying or driving vehicles initiate their operation by 
themselves, and are often able to modify the rules of their actions 
based on information received from the environment. Self-learning 
autonomous systems are additionally able to collect and process 
information about the environment and the context of the situa- 
tion, and to modify the models of their activities. Decision-making 
systems, which additionally make important decisions based on 
newly acquired information, are a special case of such systems. The 
hierarchy of systems in the area of intelligence is shown in Figure 1. 


Intelligence 
a —~ 
Natural intelligence — Artificial intelligence 
a —~ 
Narrow artificial intelligence General artificial intelligence 
“ap “a 


Interactive systems 
Autonomous systems 
Self-learning systems 

Decision-making systems 


Figure 1: The autonomous decision-making systems in the 
area of intelligence. 


Although there are no completely autonomous decision-making 
systems, there are modern devices that are confronted with situ- 
ations requiring a choice based on a new context. The dynamic 
technological development has the potential to create independent 
systems [14]. This raises many problems related to the responsi- 
bility for the choices and actions taken, as well as the moral evalu- 
ation of actions undertaken. To simplify the procedures, it might 
be recommended to treat AI systems as a product [14]. In addition, 
the primary responsibility for the operation of AI devices is put 
on software developers [24]. In the case of autonomous decision- 
making systems, it is not known what conclusions - and therefore 
actions - the machine can ultimately reach. Currently developed 
decision-making systems are usually based on machine learning, 
and especially on deep learning technology, which can often lead to 
unexpected behavior [14]. The more freedom in making decisions 
an AI device receives, i.e. the greater the level of machine auton- 
omy, the more hazards it may generate. Such software is highly 
unpredictable, as it is based largely on ad-hoc data, and as a re- 
sult the probability of obtaining completely unexpected or even 
unintended outcomes is increased. In many difficult situations, it 
is recommended to transfer the decision to a person who has the 
capability to take control of the AI system operation at any time, 
especially when the operation of the machine may cause damage 
or threat to human health or life [13]. Some authors are inclined to 
the thesis that man should be the last link of every decision-making 
chain [12, 22]. To evaluate certain behaviors, there is a need for 
a moral judgment to decide whether man should take control of 
the device [29]. The doubts arising from the unclear relationship 
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between the device and the damage triggered movements under- 
taken by legal regulators to eliminate the problem of "distributed 
irresponsibility" [14]. 


5 THE PROBLEMS OF CONTEMPORARY 
ETHICAL STANDARDS 


Contemporary regulations are trying to keep up with the dynami- 
cally developing technological progress. In the past decade, a lot 
of work has been undertaken to eliminate the dissonance between 
the law and the actual usage of autonomous systems. However, the 
law still fails to guarantee fair regulations in the futuristic context 
of AI [9, 14]. 

Many problems arise in the process of creating new, universally 
applicable standards in the field of autonomous decision-making 
systems. The first problem is delegating decisions. The questions 
concern determining the scope of the decision, responsibility for 
the decisions made, as well as the comprehensibility of the reasons 
for the decision itself. Another problem is injustice and social in- 
equality. In the case of self-learning systems, a frequent case is the 
bias of algorithms that ends up with biased reasoning, which affects 
the decisions made [18]. Personal data protection also causes many 
problems regarding the tracking of user preferences, and their dis- 
closure directly or indirectly [4]. Users are often treated subjectively, 
which violates human dignity. Another example of problems is the 
methodology of the implementation of ethical norms in algorithms: 
what source of ethical norms should be adopted, which norms are 
obligatory for particular geolocations, and who is ultimately re- 
sponsible for the operation of the device in the situation of a breach 
of the adopted ethical norms. 

In commonly known ethical standards [1, 5, 10, 12, 25] it is pos- 
sible to distinguish some common groups of values, mentioned 
in most norms. Some of them are a big challenge for AI device 
manufacturers; the transparency of choices and actions is a good 
example of these challenges. This problem is known as the "Black 
box" [28]. The premises of the reasoning in neural networks al- 
gorithms that AI systems usually rely on are not clear. Therefore, 
ethical standards also raise the issue of the clarification of the 
decision-making process. From the user’s perspective, it should be 
intelligible, and from the producer’s and developer’s perspective, 
it should be accountable. The next value is justice of the choices 
made, but, as mentioned above, there is no rigid framework for this, 
although it is already being implemented in some areas [8]. The 
standards implement it differently, namely, non-maleficence and 
doing beneficence are recommended to be implemented by devel- 
opers. The standards also remind producers about the privacy of 
data processed using AI systems and about data anonymization, if 
possible. An important problem is to determine who is responsible 
for the choices and actions of AI systems, especially when it comes 
to autonomous and adaptive technologies. The problem is how to 
distribute responsibility between the creators, operators, and users 
of AI systems. 

The creation of ethical standards and regulations usually encoun- 
ters various problems in terms of their application by manufacturers. 
The first problem is the selectivity in choosing the rules to imple- 
ment (ethics shopping), ie. some rules are implemented, while the 
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others are ignored. Another problem is ethics bluewashing, ie. cre- 
ating standards or rules that have no real impact on the behavior of 
the device, just to show the use of ethical standards in created AI 
systems, in order to improve the brand or product image. Another 
example is ethics lobbying, when the created rules are beneficial 
for a given producer, region, or nation. A similar problem is ethics 
dumping. Since imposing ethical standards is seen as a restriction 
to devices, the competitive manufacturers perform ethics dumping 
by asserting that they do not restrict their devices ethically (by ap- 
plying ethical standards) as other manufacturers. The last problem 
is ethics shirking, i.e. not applying the ethical norms and princi- 
ples, because no severe penalties are applied for such a practice, or 
obeying ethical standards is not enforced. 


6 MORALITY IN THE CONTEXT OF 
AUTONOMOUS SYSTEMS 


The philosophy of AI considers whether intellect and consciousness 
can be assigned to an autonomous machine, if the machine works 
in a way similar to human behavior [19]. This assumption is a step 
towards granting morality to such AI systems, but this poses a wider 
problem than their mere awareness. Still, this does not diminish 
the moral nature of decisions and actions taken by autonomous, 
decision-making AI systems. Since the machine with AI can make 
decisions that are subject to moral evaluation, it becomes an actor 
subject to moral evaluation as well. 

The seemingly trivial moral judgments may often have hidden 
complexity. Although AI systems are technologically advanced, 
there are beliefs [22] that they will not be able to make morally 
correct decisions as they will not be capable of ethical reflection. 
In a situation of moral dilemmas, when there is no unambiguous, 
ethically justified solution, and various ethical theories propose 
different solutions, reasonable choice should be made, justified, and 
explained. This imposes flexibility in AI decision-making systems, 
as sticking to only one ethical theory may lead to the acceptance 
of an undesirable pattern of moral behavior, ie. to puritanism [22]. 
Thus, it is impossible to adopt one ethical theory and order its 
implementation by ethical standards in AI systems, because the 
very choice of such a theory is already a moral choice, determining 
the concepts of good and evil. Moreover, there will be no choice 
here. 

According to the current state of art, Al is neither emotional 
nor moral intelligence. Choosing the proper solution is related to 
the common sense, which is realized in making the right choices, 
and which only characterizes people [22]. Reasoning is also not 
the feature of AI systems. The decisions of AI machines are, at 
best, rational. In making the proper decisions in the moral sense, 
three values should be followed: righteousness (compliance), for- 
bearance, and prudence [22]. All these features are attributed only 
to man: forbearance, awareness of the limits of the intelligence, 
and the prudence of discerning what decisions are the most appro- 
priate. Having moral intelligence is related to the ability to make 
judgments based on the intellectual and moral virtues that consti- 
tute the personality of the subject. Additionally, moral intelligence 
represents the ability to be honest, responsible, empathetic, and 
forgiving [22]. General theology connects morality, self-awareness, 
interaction, and even love in a man who is the image of God [26]. 
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A man has the ability to limit the wide area of analysis, using 
various relations, contexts, or shortcuts, thus setting the framework 
to the decision process [26]. A similar selection should be made 
by the decision-making system, as otherwise the system will not 
be able to process a huge amount of data, or the decision-making 
process will be stretched over time. Man uses his mind in a highly 
selective way, whereas the machine processes all the information it 
obtains. The selection skills of man are shaped through interactions 
with the environment, and the acquisition of such skills is a long- 
term process and requires a social context. This entails such skills as 
recognizing emotions, compassion, and understanding. Establishing 
and maintaining interpersonal interactions, identifying social roles 
or practices, and managing one’s behavior are the consequences of 
moral intelligence, too. 

The theological approach to AI systems, assuming they are simi- 
lar to a man and thus as the image of God, imposes selected abilities 
on the subject, such as the ability to think abstractly, behave morally, 
maintain a relationship with the environment (following the for- 
mula of divine relations), transcend oneself, or having a sense of 
freedom. The emotional area of AI systems may show inauthen- 
ticity, simulation, or even a lack of empathy. This is problematic 
because the behavior of AI systems may shape the general moral 
status in the future, as they will become part of a system that sets 
certain limits of behavior. 

It must be admitted that any information technology is not 
morally neutral. When making decisions, it implements acts that 
are not amoral. Thus, there is a chance to implement at least a moral 
base in autonomous decision-making systems, because the basic 
moral principles do not depend essentially on external factors, such 
as the decisions of society: evil is evil even if everyone does evil, 
and good is always good, even if no one does well. 


7 TOWARDS THE ELECTRONIC 
PERSONALITY OF AI SYSTEMS 


The development of AI is approaching a limit that is difficult to 
exceed, namely the self-awareness of AI systems. Currently, the 
sense of one’s existence is attributed only to man. Nevertheless, 
research on AI does not exclude the emergence of machines with 
consciousness, capable of non-obvious thinking. 

Nowadays, AI successfully models abstract and conscious human 
reasoning, relying on the accepted principles of logic. These are only 
some selected functions of the human mind that man performs in 
an automated manner, without further reflection on them, in a sen- 
sorimotor way [26]. The producers of autonomous decision-making 
systems face challenges such as empathizing systems, respecting 
dignity, and the desire of good for other individuals. Other chal- 
lenges are forbearance, understanding, discovering hidden meaning, 
and significant social interactions. 

Unlike consciousness or self, personality can be determined us- 
ing technical parameters such as temperament levels. The topic 
of electronic personality (e-person) is raised in the discussions on 
AI systems. Furthermore, emphasis is placed on the deontological 
dimension of a subject that respects the adopted ethical standards 
in decision-making processes. Contemporary regulations place AI 
systems in the area of digital ethics, for which dignity in the deon- 
tological understanding is the center [14]. 


147 


The adaptation of the OODA loop to the decision-making systems processing Big Data in the area of morality 


8 ADAPTATION OF THE OODA LOOP TO THE 
DECISION-MAKING PROCESS IN THE AREA 
OF MORALITY 


In the case of making moral autonomous decisions, we deal with the 
concept of moral intelligence. This concept is based on the feeling 
of moral consequences related to the decisions made. It raises many 
ethical problems and confusion when it comes to moral dilemmas. 
Moral intelligence also implies having a basic knowledge of ethics, 
which allows making choices following ethical principles. 

The first step in adapting the OODA loop to ethical decisions is 
the implementation of ethical principles, practices and standards 
as a permanent component. This adaptation takes place in the sec- 
ond stage of the loop, when setting parameters and weights for 
individual information collected in the first stage of the loop. It 
would be a huge risk to let self-learning algorithms independently 
define basic moral concepts, because it may lead to cognitive distor- 
tions. The limitations of algorithms, pose problems in defining such 
concepts as good or luck. When trying to define these concepts as 
primary, we come to the problem of regressus ad infinitum, namely 
the subsequent concepts have to be defined, and so on. The fuzzy 
set theory can be a solution to this problem, as this theory can deal 
with imprecise information, and provide precise output needed for 
the decision algorithms [27]. Creating a predefined moral frame- 
work allows for autonomous and free decisions, albeit with one 
limitation: decisions and actions are aimed at producing the great- 
est possible good or the least evil. Therefore, it is a model of moral 
acceptability [22], used in ethical theories. Moreover, this paper 
assumes that the data that is being retrieved in this stage is reliable. 
The issue of quality of the collected data from various sensors is a 
complex issue, analyzed in contemporary research [23] although it 
is not the subject of this research. Undoubtedly, this is an issue for 
future consideration to improve moral inference. 

To adjust the inference stage, the essence of the moral decision 
and the ethical context, that influences it, have to be understood. 
Moral consideration comes down to a multifaceted approach to the 
situation and the selection of a set of specific moral principles that 
are the reference. As mentioned, they should be predefined in the 
decision-making system, although many ethical approaches may 
be implemented. The selection of the most suitable among them 
may be made by the AI system in the second stage of the OODA 
loop, recognizing the situational context. However, in this case, it 
is not possible to rely on the principles developed from previous 
loop outcomes. In line with the unpredictability in the decision- 
making process, they can lead to immoral decisions. It should be 
noted that predefining ethical principles, on the one hand, allows 
maintaining high morality of conclusions, and on the other hand, 
prevents subjective verification or creation of machine’s criteria for 
moral decisions. 

The second stage of the OODA loop should be extended with 
weight classes, resulting from the significance of data that can be 
assigned with different values, but only in the areas of predefined 
ranges. These ranges will be adapted by the decision-making system 
to a specific situation or context and will result from the adopted 
ethical norms and principles. Predefining the weight classes de- 
pends on the manufacturer, responsible for the level of freedom 
that will be left to the system in making decisions. Therefore, the 
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autonomy of the decision-making system lies only in the range 
allowed by the manufacturer. Such a reduction should be seen as 
the virtue of the system. There are also such good limitations in 
nature, e.g. limiting the killing of other creatures (with the general 
assumption of species survival). 

To sum up, the approach to the OODA loop, proposed in this 
paper, allows the usage of this methodology in the area of moral in- 
ference. This requires extending the loop in three stages. In the first 
stage (observation), various ethical standards and norms should be 
implemented. In the second stage (orientation), the parameters and 
weight ranges, corresponding to them, should be defined, e.g. using 
fuzzy sets. Finally, the third, decision-making stage, should con- 
tain the choice of a specific ethical theory, based on the situational 
context. 

The proposed adaptation can be used in many autonomous sys- 
tems that collect a lot of data from the environment and must 
efficiently make decisions. Self-propelled vehicles or autonomous 
robots are good application examples. The proposed solution can 
also be used in automatic systems, such as evaluation systems (i.a. 
of prisoners), moderating social media posts, or systems for recog- 
nizing emotions and moral attitudes. 


9 DISCUSSION 


K.A. Chagal-Feferkorn [3] described the use of OODA loops in 
decision algorithms. The author distinguishes between criteria that 
must be met by decision-making systems of various categories, e.g. 
life and death decisions, decisions requiring real-time inference, or 
the dynamic nature of the sources, from which the system obtains 
information. In addition, to assess the decision-making autonomy 
of the AI system, human-based metrics are used in order to assess 
to what extent the device can replace man and what the device’s 
freedom of operation looks like. Such metrics depend on many 
elements at individual stages and therefore should be considered 
in a specific stage of the decision loop. 

The use of any decision-making framework by AI systems brings 
problems. E. Magrani [14] rightly indicates the values that should 
be taken into account in the preparation of such systems: fairness, 
reliability, security, privacy, data protection, inclusiveness, trans- 
parency, and accountability. Nowadays, it is not always possible to 
provide all these attributes at the same time, although this cannot 
be an excuse for abandoning any of them. The provision of the 
indicated values ultimately depends not on the loop itself, but on 
the methods used at its stages in the process of implementing the 
functionality. Therefore, mentioned attributes are important guide- 
lines, but they do not affect the methodology of the OODA loop. As 
rightly noted by E. Magrani, the designed models should be focused 
on a human being. They should be sensitive to values, which is 
ensured by the OODA loop in the case of taking into account the 
adaptation requirements, described in Section 8. 

From the ethics point of view, using the OODA loop causes 
dealing with heuristics. G. Szulczewski [22] therefore sees a problem 
related to the unpredictability of moral reflection in the decision- 
making process. Additionally, he notices that in many situations 
there is no time to perform analyzes. These problems are real, but 
they are caused by the decision-making process itself not by moral 
analysis. The advantage of decision-making systems over a man 
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should be noticed here, as the systems can use a lot of information 
that would require decades for a person to learn, or even would 
never be learned. 

Undoubtedly, the use of the OODA loop in decision-making 
processes, including moral choices, will not provide autonomous 
systems with features of the moral intellect of a human being, such 
as moral scruples, psychological discomfort, or compassion. These 
attributes influence a person’s moral decisions, as they are used 
as guidelines to make the correct choice. Currently, no system can 
implement such features, because they are associated with the self- 
awareness of existence. Still, morally correct choices can be made in 
a specific situation, which is denied by G. Szulczewski [22]. Indeed, 
this is still an adaptive decision-making process, albeit leading to a 
morally correct decision and therefore moral action. 

The use of models that allow moral inference, as rightly noted 
by E. Magrani [14], obliges us to consider such systems as the 
so-called moral machines that actively participate in society. This 
entails important challenges for producers of autonomous decision- 
making systems, especially their responsibility for the machines 
that constitute human reality. The use of various moral models, 
including the OODA loop, allows avoiding many immoral behaviors, 
which is always an added value for the environment participating 
in the interaction. 


10 CONCLUSIONS 


The OODA loop is a proven solution used in autonomous systems 
processing Big Data. The application of the adaptive assumptions, 
proposed in Section 8 of this paper, allows the use of this cycle 
also for moral inferences. In some circumstances, the OODA cycle 
works better than humans, as it can use a lot of information that a 
person is unable to learn in his life. 

The use of any solution does not change the fact that we are still 
dealing with a limited machine, although adapted to autonomous 
functioning in society. Contemporary autonomous decision-making 
systems only pretend to possess unattainable human attributes, 
such as consciousness, but in most cases, they do not need it. It is 
enough for them to be guided by superior, desirable ethical values 
in their interactions with the environment. 

While it is currently impossible to build moral intelligence, there 
are no barriers to scientific development. The use of various solu- 
tions in the decision-making processes of ethical (at least to some 
extent) AI systems is a step towards creating the volitional area of 
machines. The OODA loop is a very good example that, on the one 
hand, it is possible at all, and on the other hand, it does not require 
a lot of work. The way to achieve the goal is to adapt to human 
decision-making in the areas of morality. 
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ABSTRACT 


This paper applies data mining of weight measures to discover 
possible long-distance trade routes among Bronze Age civilizations 
from the Mediterranean area to India. As a result, a new northern 
route via the Black Sea is discovered between the Minoan and the 
Indus Valley civilizations. This discovery enhances the growing set 
of evidence for a strong and vibrant connection among Bronze Age 
civilizations. 
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1 INTRODUCTION 


Discovering long-distance trade relations gives a deeper insight 
into the economies of ancient civilizations. For example, lead ingots 
were traded between Sardinia and Israel (Yahalom-Mack et al. [13]), 
and the Minoans on the island of Crete traded vervet monkeys 
and baboons with eastern Africa (Urbani, and Dionisios [12]) and 
cumin (Cuminum cyminum) with India (Tsafou and Garcia-Granero 
[11]). Together with the exotic goods, their names also spread as 
loanwords [1]. However, exotic goods constituted only a small part 
of the trade among ancient civilizations. A more sophisticated view 
of the intensity of trade relations can be obtained by an analysis of 
the weights that were used at various locations. 

Recently, Ialongo et al. [4] published an analysis of the Bronze 
Age weight system and argued that an essentially common weight 
system spread from Mesopotamia to the west all the way to Ireland 
and to the east all the way to the Indus Valley Civilization. They 
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gave a mathematical analysis that suggests that as merchants trav- 
eled from one place to another, they took their balance scales and 
weights with them and allowed the local merchants to copy these 
weights. Therefore, the main mode of weight exchange was succes- 
sive copies being made throughout a huge trade zone that did not 
have a central authority over it. That is surprising and contradicts 
earlier assumptions that the introduction of a unified weight sys- 
tem requires a central authority that is intent to standardize trade 
within the realm of some kingdom or empire by fixing a standard 
weight to which every other weight must be adjusted. 

Ialongo et al. [4] showed that while a uniform weight system 
could emerge without the intervention of a central authority, the 
successive copying of weights meant that the average unit weight 
gradually shifted from the original Mesopotamian unit as the use 
of the weights spread to the peripheries. 

Instead of taking an overall view of the spread of the Bronze 
Age weight system, in this paper we focus on the Minoan weight 
system and try to answer the question of from where the Minoans 
acquired their weight measures. 

The rest of this paper is organized as follows. Section 2 describes 
the data sources with a full listing of all the known weights that 
were used by the Minoans and others in the Near East and Middle 
East in ancient times. Section 3 presents the data mining results 
with the main discoveries of associations between the weights of 
several locations. Section 4 discusses the results in terms of geo- 
graphical distribution and possible alternate trade routes that may 
have existed in the past. Finally, Section 5 gives some conclusions 
and describes future work. 


2 DATA SOURCES OF WEIGHT MEASURES 


Ialongo et al. [4] provided the exact measurement of 2274 Bronze 
Age stone, metal weights. They collected this large data set over 
ten years by visiting various museums and taking measurements of 
the weights contained in the collections of those museums. Table 1 
shows that there are 71 Minoan weight measures from Crete and 112 
Minoan weight measures from the Cyclades, which are represented 
by Akrotiri, Ayia Irini, and Philakopi. Hence there are a total of 183 
Minoan weight measures within the large data set. Out of these 
four pairs have identical weights, which are highlighted in light 
green. We shifted some of the rows left or right to align the identical 
weights. Without double counting the identical weights, there are 
a total of 179 different Minoan weight measurements. When the 
exact locations of the Cretan weight measures are unknown, then 
the site name is simply indicated as ‘Crete? Similar data is available 
for the other Bronze Age sites in Ialongo et al. [4]. 
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Table 1: Weight measures in grams at various Minoan Bronze Age sites 


Site Weights 


Akrotiri 


11.8, 12.7, 14.5, 16.3, 20.2, 23.2, 28.9, 32, 33, 35.25, 36, 37.8, 39, 39.65, 41.85, 42.5, 48.9, 52.9, 54.5, 56.6, 58, 65, 66.5, 80.3, 


84, 86.1, 88.1, 88.3, 92.2, 95.4, 110, 115.3, 119.6, 169.7, 184.9, 187, 216, 234.5, 236.9, 239.9, 241.9, 252.5, 297.7, 357.8, 
369.2, 483.8, 704.6, 744.3, 1009.4, 1021.2, 1028, 1162, 1408.4, 1506.2, 1619 


12, 13.2, 13.6, 15, 15.2, 15.4, 15.9, 20.25, 20.35, 28.1, 28.6, 30.35, 31.1, 31.5, 31.6, 34.4, 34.9, 35.8, 38.7, 39.4, 39.7, 40.1, 


42.3, 53.3, 54.8, 55.75, 57.9, 58.1, 58.3, 58.9, 59.95, 60.3, 61.15, 61.25, 63.6, 64, 64.9, 65.55, 67.05, 70.45, 79.9, 83.05, 85.7, 


Ayia Irini 

88.7, 91.9, 97.7, 121, 219.2, 390.6, 506.6, 626.7, 965.2, 1030.1, 1158.8 
Crete 3.6, 7.5, 8.4, 43.25, 66.5, 73.62, 94, 113, 1140 
Haghia Triada 24.3, 50.7, 237.1, 319, 402.9, 1487.8 
Katsambas 9.3, 9.8, 10.2, 10.5, 11.4 
Knossos 5.15, 8.45, 8.54, 12.6, 15.57, 16, 19.4, 19.82, 22.05, 35, 42.7, 59.92, 62.26, 96.4, 273.47, 327.02, 1567.47 
Mavro Spelio 11.4, 57.4, 74.4, 251.8 
Mochlos 19.4, 29.3, 30.4, 32.1, 43.7, 44.5, 92.9, 342.2, 720.3, 828.5, 14581.1 
Pachyammos 31.7 
Palaikastro 7.8, 14.4, 33.38, 63.1 
Philakopi 190, 470, 1530 
Praisos 46, 506.9 
Tylissos 6.4, 9.5, 23.9, 30.6, 33.7, 40.8, 47.2, 220, 310, 472.4, 473.4, 477.5 
Zakros 220, 1421.3 


3 DATA ANALYSIS 


According to Ialongo et al. [4], the weight measures were repeat- 
edly copied as merchants spread the weight system wherever they 
traveled. Each copying could introduce some error. Hence if there 
is an exact match between two weights, then it is a strong indica- 
tion that one is a direct copy of the other instead of being just a 
copy of a copy of some degree. In other words, perfect matches are 
indicators of direct trade links where the merchants took goods 
and their weights between the two locations. To investigate direct 
trade links, we selected three broad regions for our study: 


1. The Minoan civilization, which flourished in the Bronze Age 
on Crete and the Cyclades. 

2. The Fertile Crescent, which in our study included 
Mesopotamia, Syria, and southern Anatolia. 

3. The Indus Valley civilization, which includes three major 
towns: Chanhu-Daro, Harappa, and Mohenjo-Daro. 


We identified all perfect matches among the weights in the data- 
base that linked across at least two of these three regions and in- 
volved the Minoan civilization. Figure 1 shows these inter-regional 
matches. We found 30 inter-regional matching weight measure 
pairs, triangles and quadrangles that involved the Minoan civi- 
lization. Out of these 30 inter-regional matching weight groups, 
26 groups, or about 86.7 %, agree with Ialongo et al.’s theory of a 
gradual spread from the Fertile Crescent area to both west to the 
Aegean area and west to the Indus Valley civilization. However, 
four groups, or about 13.3 %, do not seem to fit well into that theory. 
These groups, which area highlighted in pink in Table 2, are puz- 
zling for the theory because they show matching weights between 
the Indus Valley and the Minoan civilizations without the common 
value occurring anywhere in the Fertile Crescent. 

Consider for instance the 28.6 grams weight triplet. According to 
Talongo et als theory, a 28.6 grams weight had to exist somewhere 
in the Fertile Crescent, and from there merchants traveled to at 
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least three different destinations (1) Ayia Irini, (2) Harappa, and (3) 
Mohenjo-Daro, where perfect copies of these were made. 

If we were to accept Ialongo et al.'s theory, then we must assume 
that all the 28.6 grams weights that were used in the Fertile Crescent 
are now lost. That is unlikely. We also would have to accept that at 
three different times perfect copies were made of the 28.6 grams 
weight that is now lost. That is also unlikely. 

The alternative explanation for the existence of the 28.6 grams 
weight triplet is that the original 28.6 grams weight existed in the 
Indus Valley civilization. Suppose that a Harappan merchant took it 
to Mohenjo-Daro, where a copy was made, and to Ayia Irini, where 
another copy was made. This explanation is more plausible than 
the previous one because it presupposes making two perfect copies 
instead of three perfect copies and presupposes that the weights 
survive in the place of origin instead of being lost. 

Of course, even more explanations are possible. For instance, 
one may suppose that the original Fertile Crescent weight was the 
28.4 grams weight found at Ur, and from there a merchant traveled 
to Mohenjo-Daro where a copy was made with some error that 
resulted in a 28.6 grams weight. Then a perfect copy was made at 
Harappa. Then the Mesopotamian merchant went to another trip 
to Ayia Irini, where again the same error of 0.2 grams was made, 
which again resulted in a 28.6 grams weight. The advantage of this 
explanation is that we no longer must account for a lost weight. 
On the other hand, it has only a small probability that the same 
magnitude copying error will be made at two different locations at 
two different times. 

Any statistical model of the various scenarios to explain the 
data will have to rely on various assumptions about the probability 
of a certain weight measure being lost in a region and about the 
probability of copying errors of various sizes. Unfortunately, these 
assumptions may always remain questionable to those people who 
cannot imagine a direct trade route between the Indus Valley and 
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Figure 1: Inter-regional weight measure matches among Fertile Crescent (green), Aegean (blue), including the island of Crete 
(dark blue), and the Indus Valley Civilization (pink). The green weights can be assumed to have spread from the Fertile Cres- 
cent, but the pink weights suggest a direct contact between the Aegean and the Indus Valley Civilization. Full Width Figures. 


Figure 2: The Trapezus-Tebriz-Astarabad-Meru-Buhhara-Shortugai trade route. (This map is based on_ https: 
//en.wikipedia.org/wiki/File:C%2BB-Trade-Map 1-HitherAsiaTradeRoutes.JPG) 


the Minoan civilization that avoids Mesopotamia. Hence, it seems 
important to describe some possible alternative trade routes that 
merchants may have taken in the Bronze Age between the Indus 
Valley and the Aegean area. This we will do in the next section. 


4 DISCUSSION OF THE RESULTS 


In Table 2, the matching weight groups highlighted in pink link the 
Indus Valley civilization with the Cycladic subgroup of the Minoan 
civilization. In particular, the direct Indus Valley-Minoan trade links 
seem to lead to Akrotiri, which was the main commercial center 


on the island Thera, which is now called Santorini, and Ayia Irini, 


which was the main commercial center on the island of Keos. This 
suggests the following possible alternate route. 


4.1 An Alternative Route 


A possible route may have started from Shortugai, which was an 
Indus Valley civilization outpost town in the Himalayas. It is found 
today in Afghanistan. In Figure 2, we added Shortugai to a map 
that shows some ancient trade routes. It can be supposed that these 
routes already existed in the Bronze Age because we know that 
through Shortugai tin and lapis lazuli were transported to the Indus 
Valley civilization. 
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This alternative route would go from Shortugai west on the 
ancient Oxus River, which is now called the Amu Darya River. 
After Buhhara, a merchant could travel southwest to Meru, then to 
Astarabad, then to Tebriz, and finally to Trapezus. The Indus Valley 
and the Minoan traders could have met at Trapezus because the 
Minoans could sail through the Dardanelles and the Bosporus straits 
and sail along the northern coast of Turkey to reach Trapezus. 

This scenario is plausible because the mountain dwelling Short- 
ugai traders would need to travel through the mountains and the is- 
land dwelling Minoans would need to sail on the seas. An advantage 
of this direct Indus Valley-Minoan trade would be to circumvent 
the Mesopotamian intermediates, who would likely raise the price 


of the goods. 


4.2 Further Evidence of a Northern Trade 
Route 

If the Indus Valley traders would turn south around Lake Van, 

then they could also reach Cape Gelydonia and Ebla directly. 

There are many perfect matches between the Indus Valley weights 

and the Cape Gelydoani and Ebla weights without a matching 

Mesopotamian weight. This again suggests that the Indus Valley 
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Figure 3: Inter-regional weight measure matches among Fertile Crescent (green), Aegean (blue), including the island of Crete 
(dark blue), the Indus Valley Civilization (pink), and Maramures (orange). 


Figure 4: A Venn diagram of those weights that are the most important indicators of direct trade. Pink weights indicate a direct 
trade between the Indus Valley civilization and either the Aegean or Maramures. The orange weight indicates a direct trade 
between the Aegean and Maramures. The green weight indicates a likely source from the Fertile Crescent. 


traders had direct contacts with traders from Cape Gelydonia and 
from Ebla. 

Furthermore, there also may have been direct contacts between 
the Indus Valley civilization and some Bronze Age successors of 
the Old European culture in Southeastern Europe. In 1880, Hampel 
[3] reported a set of weights for the gold treasure found in today’s 
county of Maramures, Romania (formerly Marmaros, Hungary). 
Figure 3 shows the associations between those weights and the 
other sites from Ialongo et al. [4]. 

Figure 4 shows a Venn Diagram of those weights that indicate 
direct trade among any pair of the three periphery regions of the 
Aegean, the Indus Valley Civilization, and Maramures. The 12.6 
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weight match between the Aegean and Maramures suggests direct 
trade between the two regions. Similarly, the 54-weight match be- 
tween the Indus Valley civilization and Maramures suggests direct 
trade between those two regions. The 15, 28.6, 35.8, and 54.5 weight 
matches between the Aegean and the Indus Valley civilization im- 
plies direct trade between those two regions. 

Figure 2 already suggested a trade route between the Black Sea 
port of Trapezus and the Indus Valley civilization town of Shortugai. 
From Trapezus one can sail on the Black Sea to the Danube Delta, 
and from there reach its tributaries, including the Tisza River, which 
leads to the Maramures area as shown in Figure 5. This hypothetical 
route would be a natural connection between the Indus Valley 
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Figure 5: A hypothetical Maramures-Trapezus-Shortugai trade route. (This map is based on the direction finder of Google 
Maps https://www.google.com/maps using the keywords of Maramures and Shortugai.) 


civilization and the Maramures area. In addition, it is possible to 
sail from the Danube Delta to the Sea of Marmara and from there to 
the Aegean Sea. This hypothetical route would be the most suitable 
connection between Maramures and the Minoan sites in the Aegean 
area. It was probably the Minoans who have sailed this sea route 
between the Danube Delta and the Aegean. 


4.3. Related Work on the Minoan and the Indus 
Valley Civilizations 


Our study of trade relations adds valuable information to the already 
known data regarding the Minoan and the Indus Valley civilizations. 
Recent advances in archaeogenetics yielded both mitochondrial and 
autosomal DNA data for the Minoans. Analyses of the mitochon- 
drial [6] and the autosomal [10] DNA data consistently show that 
the Minoan society was composed of several groups. One group 
likely came from Anatolia, while the other group came from the 
Danube Basin and the western littoral area of the Black Sea [6]. The 
connection with the Anatolian farmers may go back to the earliest 
farmers in Crete because agriculture spread from Anatolia to the 
Aegean islands. 

The connection with the Danube Basin may stem from the early 
Bronze Age. Many new migrants likely arrived at the island of Crete 
at the beginning of the Minoan civilization, which is called the Early 
Minoan period, around 3000 BC. Another wave of migrants arrived 
at the beginning of the Middle Minoan period around 2200 BC 
according to Arthur Evans. The exact chronology is debated by the 
archaeologists. However, they agree that writing was introduced 
to Crete during the Middle Minoan period. 

Minoan writing had two forms: Cretan Hieroglyphs and the Lin- 
ear A script. In 1991, Marija Gimbutas already pointed out some 
similarities between the Linear A script and the Danubian script 
signs. Her observation also suggests some population movement 
from the Danube Basin to Crete paralleling the archaeogenetic 
data. Hence, the earliest scribal class of the Middle Minoan period 
likely consisted of the new migrants from the Danube Basin, and 
the underlying language of the Linear A script could be related to 
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the Pre-Indo-European language of the Old European civilization. 
The translation of twenty-eight mostly religious Linear A inscrip- 
tions suggests that the scribal language was a Uralic language [5]. 
The Uralic language speaking peoples were assumed to have had 
a homeland somewhere near the Ural Mountains. However, it is 
possible that this language family goes back to the Mesolithic or 
Paleolithic period because there are no cognate agriculture-related 
words between the Finno-Permic and the Ugric branch of languages. 
In that early period, the Uralic homeland was likely in the Danube 
Basin rather than anywhere more to the north. 

Surprisingly, there are many cognate Pre-Greek and Ugric words 
[8]. These Pre-Greek cognate words likely were borrowed from the 
Minoan language by Greek. In addition, a graph-based algorithmic 
analysis of Minoan inscriptions was able to show that the Minoan 
language had front-back vowel harmony [9]. Front-back vowel 
harmony likely was already present in the Proto-Uralic language 
and is wide-spread within the Uralic language family. Front-back 
vowel harmony is not a characteristic of Indo-European languages. 

The scribal language may have been different from the common 
Minoan language because the Minoan civilization may have been 
multilingual during the Middle Minoan period. Homer (Odyssey, 
book 19, lines 172-177) described the island of Crete as a multilin- 
gual and multiethnic place around 800 BC. While Homer mentions 
Achaeans and Dorians, two well-known Greek groups, he also 
mentions other groups that could be Pre-Greek: the Eteocretans, 
or ‘true Cretans’, the Kydones, and the Pelasgians. Given that the 
Mycenaean civilization followed the Minoan civilization and Crete 
remained continuously under Greek dominance until Homer, it is 
hard to explain the presence of these apparently Pre-Greek groups 
unless they were already in Crete during the Minoan period. 

A detailed art motif analysis [7] also identified three sets of 
Minoan art motifs. The first set contains art motifs that spread from 
the Near East via the spread of agriculture. These motifs spread both 
eastward and westward together with agriculture and can be found 
in both the Indus Valley and Crete during the Early Minoan period. 
The second set of motifs apparently originated in the Danube Basin 
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because they first appear there in the Neolithic or early Bronze 
Age. This second set of motifs first appear on Crete during the 
Middle Minoan period. Finally, the third set contains art motifs that 
were likely brought in by the Mycenaeans because they appear on 
Crete during the Late Minoan period, when the island was already 
occupied by the Mycenaeans. 

The Indus Valley script has a close connection with the Sumerian 
pictograms but only a distant relation with the Linear A and Linear 
B scripts [2]. That is likely because syllabic writing developed only 
in the Bronze Age. 


5 CONCLUSIONS AND FUTURE WORK 


We hope these techniques will enable the discovery of other possi- 
ble trade routes among ancient civilizations. In general, the method 
could be applied to discover other relations too that rely on copying 
some other metric such as length or volume. Our study of trade 
relations adds valuable information to the already known similar- 
ities. Each type of similarity shown serves as a piece of a grand 
mosaic that depicts strong and vibrant connections among Bronze 
Age cultures that were previously viewed as isolates. 
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ABSTRACT 


Making decisions quickly and efficiently is essential in all areas of 
existence; when the decision concerns a fair number of voters and 
options to vote, it is sometimes appropriate to choose a voting sys- 
tem that represents the voting population well, without excluding 
the preferences of minorities. 

In this paper we present a voting system that aims to be an en- 
hancement of Majority Judgment through unsupervised machine 
learning techniques, in particular the cluster and, in addition, a 
criterion for obtaining a multiwinner result has also been added. 
After exposing its functioning, a case study is presented to test its 
applicability, which leads to multiple fields of interest and it is not 
limited exclusively to purely political occasions. 
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1 INTRODUCTION 


In this section we deal with general behiavour of some voting rules, 
like the majority and the premised rule, trying to highlight pros 
and limits. Then, we’re interested in finding a model which guaran- 
tees inclusion for minorities during multiwinner decision-making 
process. 

A more ‘inclusive’ voting rule leads to implement a clustered ver- 
sion of the chosen model (Majority Judgement): clusters are created 
taking into account the similarity between the expressed prefer- 
ences; for each of them, Majority Judgement rule is applied to return 
a ranking over the set of candidates. Now we explore differences 
provided by different voting rules. Using a specific voting rule de- 
termines limitations and advantages: one could choose a voting 
rule in order to avoid a tactical voting approach, renouncing on 
judgements’ representiveness. We consider three agents who ex- 
press their judgements ("Yes" or "No") for four statements A, B, 
AA Band A <— B, and compare outcomes from majority and 
premised-based rule. The latter take majority decisions on A and B 
and then infers conclusions on the other two propositions. 

As shown in the table 1, results are different considering the used 
role. 

We now focus on Agent 2 case: he’s represented in just one of 
the propositions (A), and in the remaining cases his judgement 
doesn’t agree with the outcome. So, Agent 2 could think about 
manipulating the final result, by pretending a disagreement for A. 
As consequence, the premised model reacts by providing as final 
outcome on 3 agents’ votation a "No" for both A A Band A +—> B, 
as originally expressed by Agent 2. 


A | B |AAB|Ac—+>B 
Agent 1 Yes | Yes | Yes Yes 
Agent 2 Yes | No No No 
Agent 3 No | Yes | No No 
Premised rule | Yes | Yes | Yes Yes 
Majority Yes | Yes | No No 


Table 1: Three agent case of voting 
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In such a way, strategic voting is not avoided for Agent 2. This is 
the major drawback of using premised voting as rule. 

On the other hand, a paradoxal aspect arises considering the major- 
ity rule: outcomes of the latest two propositions are not consistent 
with "Yes" value assigned to both A and B. 

This is known as discursive dilemma, the inconsistency problem in 
judgement aggregation based on majority rule [11]. 

Results can be interpreted as follows: in case of majority rule, it 
is always in Agent’s interests to give his true preference. For this 
reason we consider majority rule as a transparent (with no tactical 
voting) asset in decisional process, while trying to deal with its 
intrinsic issues due to judgement aggregation [13]. 

Our aim is not solving above-mentioned dilemma, but using major- 
ity rule as a baseline for a more refined model (Majority Judgement), 
with the use of clusters for a more inclusive rule. 


2 STRATEGIES OF DECISION MAKING 


2.1 Collective decision process and Majority 
Judgement 


During business meetings it is sometimes difficult to bring together 
the ideas of all participants regarding important decisions: this can 
lead to slowness and non-productivity. In many contexts, to resolve 
this type of situation, there is a majority vote. This is a good method 
but it is a rough approximation of the collective desire, excluding 
part of those who represent the minority. In order to limit this ef- 
fect, an algorithm has been developed starting from the traditional 
Majority Judgement (MJ), enhanced with a clustering technique 
which, before calculating the resulting winner, divides the voting 
population into groups with similar preferences. 

Whereas choice theory is concerned with individuals making 
choices based on their preferences, social choice theory is con- 
cerned with how to translate the preferences of individuals into the 
preferences of a group, the case in which this process is revealed 
is the one which concerns the electoral vote [6]. During voting, 
electors in a democracy choose one candidate among a list of many 
candidates, while in a jury decision the individual judges evaluate 
competitors in a competition, ranking them. 

Arrow’s impossibility theorem in social choice theory states that 
when voters have three or more distinct alternatives (options), no 
ranked voting electoral system can convert the ranked preferences 
of individuals into a community-wide (complete and transitive) 
ranking while also meeting the specified set of criteria: unrestricted 
domain, non-dictatorship, Pareto efficiency, and independence of 
irrelevant alternatives [1]. In [18], Condorcet and Borda methods 
and limits, Arrow’s impossibility theorem and MJ are illustrated and 
the results of the general elections have shown that voting systems 
can run into the Arrow’s paradox. A known example is the 2000 
US presidential election where the presence of a minor candidate, 
Ralph Nader, who had no chance of winning, made Bush the winner 
in Gore’s place. It’s legit to suppose that the presence of Nader made 
Gore’s votes dispersed, given their political positions. Moreover, 
Nader supporters preferred Gore to Bush. However, the American 
voting system permits a single vote in a single round where all 
the candidates are available options, and the one with most votes 
is the winner. In such a way, voters haven’t fully expressed their 
preferences: the winner turned out to be just the candidate more 
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resilient to Arrow’s paradox, and not the most preferred one. 

So, MJ is a voting technique proposed by two mathematicians in 
2007, Michel Balinski and Rida Laraki, aiming to overcome tra- 
ditional voting methods’ paradoxes and inconsistencies. In [3], 
Balinski and Laraki present MJ, a model that overcomes traditional 
models which are prone to suffer from incoherence, impossibility 
and incompatibility: in[5] authors reject the assumption of tradi- 
tional methods that electors don’t really make a personal ranking 
of candidates and highlight this false belief as the reason behind 
the inadequacy of voting models different from MJ. 

In[4] they also present the case of the French presidential elections 
of 2002 as another example of Arrow’s paradox: the winner depends 
on the presence or absence of candidates, including those who have 
absolutely no chance of winning. Balinski and Laraki’s point is 
that only the presence of a common language leads to a coherent 
collective decision, and consequently a greater expressiveness in 
the voting system minimizes the paradoxal effects. MJ makes it 
possible: it asks for electors/judges to express a judgment on all the 
candidates/competitors, using a known common language. Each 
quality is associated with a numeric score and the candidate with 
the highest median score is the winner. In case of tie, a tiebreaker 
is used which considers how "broad" that median grade is. Even 
though it’s not possible to avoid completely strategical voting, MJ 
strongly resists manipulation. And this is one of the features we 
wanted to consider when modelling an inclusive and transparent 
voting rule. 


2.2 Social theory’s requirements 


May (1952) [15] introduced four such requirements for majority 
voting rule must satisfies:[7] 

e Universal domain: the domain of admissible inputs of the 
aggregation rule consists of all logically possible profiles of 
votes < v1,U2,...,Un >, where each vj € [-1, 1] (to cope 
with any level of ‘pluralism’ in its inputs); 

e Anonimity: applying any kind of permutation on individual 
preferences does not affect the outcome (to treat all voters 
equally), i.e., 


f(U1, V2, «5 Un) = f(w1, Wa, Wn) (1) 


Neutrality: each alternative has the same weight and for 
any admissible profile < v—1, v2, ...,Un >, if the votes for the 
two alternatives are reversed, the social decision is reversed 


too (to treat all alternatives equally), ie. 


f(-01, — 025 0.5 -On) = —F (01, 025 «+, Un) (2) 
e Positive responsiveness: For any admissible profile < 
V1, V2, ...,Un >, if some voters change their votes in favour of 
one alternative (say the first) and all other votes remain the 
same, the social decision does not change in the opposite di- 
rection; if the social decision was a tie prior to the change, the 
tie is broken in the direction of the change, ie., if wj > vu; for 
some i and w; = v; for all other j] and f(vj, v2, ...,Un) = 0 

or 1, then f(w1, wa, ...,Wn) = 1. 
A multi-winner election (V,C, F, k) is defined by a set of voters V 
expressing preferences over a number of candidates C, and then a 
voting rule F returns a subset of size k winning candidates. A voting 
rule can pact on different types of ordered preferences, even though 
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the most common have a pre-fixed linear order on the alternatives. 
In most of cases, these are chosen a priori. 

Formally we denote set of judgements performed by the i-th voter 
as profile preferences P;. Each profile contains information about 
the grade of candidates by voters. The voting rule F associates with 
every profile P a non-empty subset of winning candidates. 

In multi-winner elections more precise traits are required, compared 
to the ones stated in May’s theory [9]. Indeed: 


e Representation: for each subset of voters 
V; € V (with |V;| > Fal) (3) 


at least one successful candidate is elected from that parti- 
tion; 
e Proportionality: for each subset of voters 


Vj; € V (with |V;] > Fal (4) 


number of elected candidate is proportional to the subset’s 
size. 


An implicit assumption so far has been that preferences are 
ordinal: preference orderings contain no information about 
each individual’s strength or about how to compare individuals’ 
preferences with one another. In voting contexts, this assumption 
may be acceptable, but in welfare-evaluation contexts - when a 
social planner seeks to rank different social alternatives in an 
order of social welfare - the usage of further information may be 
justified. 


2.3 Single-winner Majority judgement 


In order to describe the majority judgement, we need to use a table 
that refers to ranking for all the candidates C, by using tuples [2]. 
Suppose having six possible choices we may use the words: excellent, 
very good, good, discrete, bad, very bad. 

So each candidate is described by a bounded set of vote. 

Winner is found comparing recursively median grade between 
candidates: first, grades are ordered in columns from the highest to 
the lowest according to the order relation, then the middle column 
(lower middle if number of grades are even) with the highest grade 
between candidates’ row is selected. If there’s a tie, algorithm keeps 
on discarding grades equal in value to the shared median, until one 
of the tied candidate is found to have the highest median. Our aim 
is to generalize this single-winner strategy to a multi-winner one, 
using a clustering approach in judgement aggregation. So, we first 
discuss our choices for clusters and then we describe the algorithm. 


3 CLUSTERS 


3.1 Categories of clusters 


Different types of cluster share the ability to group data with some 
common features. Some of most important types are: 

1. Connectivity models: similarity or differences arise from the 
distance made between data points. 

Two possible approaches are: bottom-up where each data point 
is a cluster and then pairs of clusters are merged; top-down, where 
observations are included in one cluster and then it’s segregated; 
limitations of this model are shown because of impossibility to 
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make changes to an already craeted cluster; 

2. Distribution models: probabilities about belonging to a 
particular distribution once the cluster is created are computed. 
Limitations are shown where no precise constraints are given, as 
this model tends to overfit data; 

3. Density models: cluster are created in areas of higher density 
of data points, while the remaining can be grouped into an 
arbitrary and distribution-less shaped area; this features make the 
model likely to be less sensitive to noise than other types of clusters; 


In our case, clusters need to satisfy social theory’s requirements 
that determines a pretty fixed structure, but with no assumption 
about data distribution. For these reasons, we focus on a different 
class of clustering algorithm, the centroid models. 


3.2 K-Medoids 


For our goal, namely selecting winners from a group of candidates, 
K-medoids clustering is used. Our choice is due to the fact that 
averaging methods like K-means clustering could result in a solution 
which doesn’t belong to the candidate list. In our case, medoid is a 
data point (unlike the centroid) which has the least total distance 
to the other members of its cluster [10]. 

Another advantage for this choice is that the mean of the data points 
is a measure that gets highly affected by the extreme points; so, in K- 
Means algorithm, the centroid may get shifted to a wrong position 
and hence result in incorrect clustering if the data has outliers. On 
the contrary, the K-Medoids algorithm is the most central element 
of the cluster, such that its distance from other points is minimum. 
Thus, compared to K-Means algorithm, K-Medoids is more robust 
to outliers and noise. [8]. 

The used K-medoid algorithm is in the python sklearn library 
[17]. This library supports partitioning around medoids (PAM) [12] 
proposed by Kaufman and Rousseeuw (1990). The workflow of PAM 
is described below [16]. 

The PAM procedure consists of two phases: BUILD and SWAP: 


e Inthe BUILD phase, primary clustering is performed, during 
which k objects are successively selected as medoids. 

The SWAP phase is an iterative process in which the algo- 
rithm makes attempts to improve some of the medoids. At 
each iteration of the algorithm, a pair is selected (medoid 
and non-medoid) such that replacing the medoid with a non- 
medoid object gives the best value of the objective function 
(the sum of the distances from each object to the nearest 
medoid). As there is the possibility of improving the out- 
come of the objective function, the procedure is repeated. 


Suppose that n objects having p variables each should be grouped 
into k (k < n) clusters, where k is known. Let us define j-th variable 
of object ias Xj; (i = 1,...,n; j = 1,...,p). Asa dissimilarity measure 
is used the Euclidean distance, that is defined, between object i and 
object j, by: 


(5) 


where i and j range from 1 to n. The medoids is selected in this 
way: 
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e calculate the Euclidean distance between every pair of all 

objects; 
dij 

e calculate vj = DL, It di’ 

e sort all vj for j = 1,...,n in ascending order and select the 
first k object that have smallest initial medoids value; 

e from each object to the nearest medoid we can obtain the 
initial cluster result; 

e calculate the sum of distances from all objects to their 
medoids; 

e update the current medoid in each cluster by replacing with 
the new medoid, selected minimizing the total distance from 
a certain object to other objects in its cluster; 

e assign each object to the nearest medoid and obtain the 
cluster result; 

e calculate the sum of distance from all objects to their 
medoids, so if the sum is equal to the previous one, then 
stop the algorithm; otherwise, go back to the update step. 


In our case, prior knowledge about the number of winners is re- 
quired, and identified clusters are restricted in minimum size that 
is number of voters on the number of candidates (q): 


3.3 Clustered Majority Judgement 


For each cluster majority judgement is applied, and a final ranking 
of candidates is returned [14]. Given k the number of candidates to 
be elected, algorithm seeks the optimal number of cluster to create. 
The number of clusters ranges from 1 to k and has to satisfy an 
additional requirement: if a tie occurs and k’ vacant seats are left, 
algorithm is repeated k’ times until tie’s broken. In case there’s no 
broken tie, the number of cluster is changed. 

We present the relevant steps in pseudocode: 


(1) set the number of winners as maximum number of clusters; 

(2) cluster are created decreasing the maximum number of clus- 
ters until the optimal number is achieved. This number is 
bound by the size of cluster, that satisfies the following equa- 
tion: number of voters : number of winners = number of voters 
in one cluster : one winner; 

(3) the function winners calculates the median for every cluster 
is created; 

(4) check that winners from clusters are different between each 
other ; in case these are not distinct (condition="ko" on pseu- 
docode) algorithm goes back to step 2 with a maximum 
number of cluster equal to the number of vacant seats and 
the successive steps are executed until all seats have been 


filled. 


3.4 Case study: Using clustered majority 
judgement to maximize agreement 


A local cultural association organizes a themed film club annually 
during the summer. It is possible both to access the single event 
and to make a subscription. To maximize the sale of season tickets, 
a portion of associated users responded to a survey concerning 
the numerous possible topics of the film club. The survey was 
structured so that each topic is assigned a value from 1 to 7 
by the voters, that corrispond to Excellent, Very Good, Good, 
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Algorithm 1 


Require: k > 0 
Ensure: n_winners = (nj,...,nz),k > 1 
k — number_winners 
max_cluster — k 
condition <— ”ko” 
while condition = ”ko” do 
cluster_list — cluster(vote_list) 
for all list_cluster do 
winners_per_cluster — compute_winners(cluster) 
all_winners < list_of_all_winners(winners_per_cluster) 
end for 


list_winner_distinct = list_of _all_distinct_winners(all_winners) 


option_remaining - number_winners = — 
len(list_winner_distinct) 
if option_remaining = 0 then 
condition =’ ok’ 
else 
k < option_remaining 
condition —’ ko’ 


end if 
end while 


Acceptable, Poor, To Reject, No Opinion, in order to implement an 
interesting comparisons of Majority Judgement (MJ) and Clustered 
Majority Judgement (CMJ). A new input parameter of CMJ - as 
compared to MJ - is introduced: the number of winners, that is 
fixed equal to 2. Into this experiment 37 voters took part and 
the algorithm form two clusters, exactly like the number of winners. 


Cluster | Cluster size | Winner 
Cluster 1 21 Satire 
Cluster 2 16 Science fiction 


Table 2: CMJ results 


Ranking MJ | Candidate 

1 

2 

Table 3: Top 2 of single-winner Majority Judgement applied 
to voters 


Satire 


History 


We can compare CMJ results with single-winner MJ ranking, 
comparing Table 3 3 and Table 2 2. We notice both for Majority 
Judgement and Clustered Majority Judgement the tendency to avoid 
the favourite topic, focusing on the moderate ones. Furthermore, 
in CMJ case, the expressed judgements are quite polarizing and the 
two formed cluster seems in opposition between each other, because 
the preferred topic for one are tendentially negatively judged by 
the other one. 

In case of MJ, the solution is Satire and History, so, considering only 
the first two classified, seeing that themes tend to be similar, it is 
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likely that a large part of the subscriptions is purchased exclusively 
by people grouped within a single cluster, which is the largest 
one. We see instead that the CMJ, in addition to Satire topic, also 
considers the preponderant preference of the second cluster. For this 
reason, this strategy is the most likely to maximize the subscription 
purchases. 


CONCLUSIONS 


In section 1, we explored different voting rules and their limitations. 
In section 2, a more fined model of majority rule, Majority Judge- 
ment, has been presented as the best model to avoid the incoherence 
produced by traditional voting methods . 

Then we present our generalization of Majority Judgement as a 
multi-winner strategy, thanks to the use of clusters. After that, a 
case study is reported, with a particular attention to the comparison 
between MJ and CMJ results. 

The CMJ, as shown, represents the optimal compromise in case of 
polarized groups (clusters). This appears to be the best model in 
order to maximize agreement, as shown in the case study, as it rep- 
resents well the minorities, by taking into account their preferences 
more carefully than a simple Majority Judgement model. 

In spite of non-deterministic nature of K-Medoids, Clustered Major- 
ity Judgement is thought to be used in high populated disputes. For 
these reasons, we feel confident about clustering’s role of consider- 
ing all different perspectives could be shown in these situations. 
As this implementation depends only on some fixed parameters, 
like the grades, their number and the number of winners to select, 
the algorithm can find space in any type of decision making process. 
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ABSTRACT 


The Industry 4.0 concept refers to new production patterns that 
include new technologies, manufacturing elements, and workforce 
organizations. It creates highly efficient production systems that 
change production processes, reduce production costs and improve 
product quality. Quality 4.0 is an evolution of Industry 4.0, which 
is a modification of traditional quality control charts. In this pa- 
per, our motivation is to improve manufacturing processes as we 
monitor product’s quality by improving the percentage of correctly 
manufactured products thereby achieving efficiency. A four-layer 
decision-making architecture is proposed where different models 
and techniques are applied and a comparative study is achieved on 
real industrial case study: 1) data exploration layer, 2) feature engi- 
neering layer, 3) modeling layer, in which three categories of time 
series forecasting algorithms are experimented: statistical model 
(ARIMA), machine learning models (Random forest and XGBOOST) 
and deep learning models (Stacked LSTM and Transformer-based 
model), and finally 4) interpretation layer. The transformer-based 
model scored the best. With the classification model’s interpreta- 
tion, we deducted the recommended values to monitor the product’s 
quality in order to reach relatively zero defects. 
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1 INTRODUCTION 


Industry 4.0 makes full use of emerging technologies and the rapid 
development of machines and tools to improve the level of industry. 
As a result, the manufactured product will be of better quality and 
production systems will be more efficient and easier to maintain. 
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According to [1], a survey conducted in 2019, 32 % of companies 
surveyed in Indonesia experienced a 31 to 50 % improvement in 
operational productivity and performance when adopting industry 
4.0. Quality 4.0 as leveraging traditional quality control techniques 
gained through the latest technology to deliver new levels of ex- 
cellence at the functional and operational levels [2]. Quality 4.0 is 
an evolution of Industry 4.0, which is a modification of traditional 
quality control charts. Organizations can monitor processes and 
extract data from real-time sensors. Quality 4.0 is an evolution of 
Industry 4.0, which is a modification of traditional quality control 
charts using the latest technology to deliver new levels of excellence 
at the functional and operational levels [2]. Manufacturers apply- 
ing Quality 4.0 technology have achieved remarkable efficiency 
in quality management, thereby expanding market share, promot- 
ing innovation, and improving their ability to face challenges and 
enhance brand recognition [2]. In this context we propose a four- 
layer decision-making architecture where different AI models and 
techniques are applied and a comparative study is achieved. This 
proposal aims at forecasting real world manufactured product’s 
properties in order to enhance its quality. In fact, the forecasted 
values are classified and interpreted in order to give recommended 
values to monitor the manufactured product’s quality. The problem 
of hyperparameter optimisation resides in the choice of the best 
set of hyperparameters for the learning algorithm using several 
techniques such as grid search and random search. According to 
[26], random search is the best method of parameter search for 
small dimensions. For our study, we tested the following common 
hyperparameters: Batch_size, epochs, optimizer and dropdown. For 
each model we add other specific parameters. For this task, we used 
Weights & Biases, a machine learning platform to track our experi- 
ments, versions and evaluate our models’ performance. Structure 
of the paper is as follows. Section 2 provides related work of AI in 
industry 4.0 along side with studies carried out in the context of 
time series forecasting in an industrial environment. In section 3 we 
introduce our proposal. Sections 4 is dedicated to experimentation 
and results. Finally, section 5 concludes the paper and announces 
future works. 


2 RELATED WORKS 


Zhang et al [8], proposed in 2017 an approach to monitor the perfor- 
mance degradation of dynamic systems in the context of manufac- 
turing. This approach is based on LSTM to characterise the degrada- 
tion behaviour of the system and then predict the remaining useful 
life (RUL). The long-term dependent properties embedded in the 
LSTM framework are intended to capture the interrelationships of 
the time series of data measured by the monitored system, allowing 
for more accurate predictions of future behaviour. The experiments 
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consist in comparing different models. The authors obtained the 
following results: For SVR: rmse = 20.96, MLP rmse= 20.84, LSTM 
rmse=18.07, Bidirectional-LSTM rmse=15.42. Bidirectional LSTM 
networks perform better. Futterer et al [9], proposed in 2017 an ap- 
plication prospects of several supervised learning methods for time 
series classification in BACS (Building Automation and Control 
Systems) as they trained thirteen types of classifiers including com- 
plex tree, average tree, simple tree, linear support vector machine, 
KNN subspace, augmented RUS tree, good KNN, raw KNN and ran- 
dom forests. Bagged trees scored the highest demonstrated average 
classification accuracy (56.76%), with the maximum accuracy level 
of 76.54%. However, the maximum accuracy achieved by random 
forests was even higher, reaching 78.95%. An extensive part of this 
literature uses traditional machine learning algorithms [9] [5] [7] 
such as random forest and linear regression. In recent years, an 
important number of research papers revolve around deep learning. 
Authors [6] [10] usually work with recurrent neural networks (RNN, 
LSTM) and convolutional neural networks (CNN). The concept of 
the Transformer took shape in 2017. The Transformer is presented 
in the paper [24] which describes its architecture and performance 
on several translation datasets. To the best of our knowledge, its 
use remains exclusive to the field of natural language preprocessing. 
This motivated us to test and to compare the performance of the 
transformers against other statistical, machine learning and deep 
learning models in forecasting time series product data. 


3 PROPOSED APPROACH 


The proposed architecture, depicted in figure 1, aims at creating 
an intelligent model for product quality prediction for industry 4.0. 
Our chosen KPI is improving quality as it indicates the percentage 
of correctly manufactured products. It is a four-layer architecture, 
namely : data exploration layer where we prepare our data to be 
suitable for the model. A Second layer, representing the feature 
engineering step where we propose a feature selection and data 
balancing approach. A Third layer that consists of a comparison of 
different categories of techniques/algorithms for time series fore- 
casting: statistics-based, machine learning-based and deep learning- 
based. And finally a forth layer dedicated to model interpretation 
helps adjust the production values. 


Interpretation 


Data exploration Feature engineering 


Modeling 


Feature selection 
+ Data balancing 


explainable Al 


(xal) 


+ 
Va 


Data balancing 
+ Feature selection 


v 


Final report 


Dataset x 


x : Original dataset 
xl: Dataset after EDA 

x2: Dataset after feature engineering 
x3: Forecasted data + classification results 


Figure 1: Proposed architecture 


In the following subs-sections the different layers of the archi- 
tecture are detailed. 
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3.1 Data exploration 


Data exploration refers to the first step of data analysis in which 
statistical techniques and visualisation are used to describe the 
characteristics of a dataset, such as size, quantity, distribution in 
order to better understand its nature. In our work, we followed the 
next steps. First, we proceeded with a raw data visualization. This 
showed us the distribution of the columns and their correlation. 
Second, we verified if the dataset contains missing, negative or null 
data. Third, we used quantile outlier removal to eliminate unde- 
sirable noise. We also counted the occurrences in the output class. 
Furthermore, since our dataset is a time series, the exploratory 
analysis expands to discover other properties of the data. A time 
series dataset is stationary when its statistical properties (expec- 
tation, variance, auto-correlation) are not time-varying. Having a 
stationary dataset means that it’s free of trends and seasonality. In 
order to verify the stationarity of our time series, we implemented 
the Dickey-Fuller test [13]. 


3.2 Feature Engineering 


In order to choose the most appropriate approach for our study 
case, we carried out a comparative analysis between the two fol- 
lowing approaches: applying feature selection followed by data 
balancing or applying data balancing followed by feature selection. 
The approach with the best classification results is chosen. For this 
purpose, we implemented a KNN classifier. 


3.2.1. Feature selection. The feature selection step consists of 
reducing the number of input variables when developing a predic- 
tive model. It is a basic technique for directing the use of variables 
to the most effective and efficient for a particular machine learning 
system. This practice allows the algorithm to adjust and learn more 
quickly and, more importantly, reduces its complexity in order to 
make it easier to interpret. According to [14], there are three feature 
selection techniques: filter method [15], wrapper method [16] and 
embedded method[17]. In our approach, we experimented all three 
techniques and compared their results to derive the best approach. 
We used as a filter method: variance threshold, as a wrapper method: 
XGBOOST with recursive feature elimination (rfe) and finally as 
an embedded method: lasso. 


3.2.2 Data balancing. Among the techniques used in data bal- 
ancing are: Oversampling which creates artificial instances of mi- 
nority classes and undersampling which is used to eliminate the 
instances corresponding to the majority class. To balance the data, 
we opted for a hybrid approach which consists of applying oversam- 
pling to balance the data, followed by undersampling to remove any 
unwanted noise. The oversampling is performed by creating syn- 
thetic minority class samples to balance the dataset with SMOTE 
[18]. For the undersampling, we compared two techniques. The 
first one is tomekLinks (TL) [20] and the second technique is Edited 
Nearest Neighbours (ENN) [19]. 


3.3 Modeling 


Time series forecasting is about making forecasts based on time- 
stamped historical data. It involves building models through his- 
torical analysis and using them to make observations and future 
policy decisions. We have chosen to compare three categories of 
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algorithms and choose the most appropriate for our case. 1) Sta- 
tistical model: ARIMA. 2) Machine learning models: Random for- 
est and XGBOOST. 3) Deep learning models: Stacked LSTM and 
Transformer-based model. 


3.3.1 Statistical model: ARIMA. The ARIMA model [21] is a 
statistical method used in stationary time series analysis and fore- 
casting. ARIMA is an abbreviation for Auto Regressive Integrated 
Moving Average. It is a forecasting algorithm based on the idea that 
the information in the past values of the time series can alone be 
used to predict the future values. 


3.3.2. Machine learning models. Machine learning techniques 
can be used for time series prediction. It is necessary to convert the 
time series prediction problem into a supervised learning problem. 
To create the new data form, the previous time steps are considered 
as input variables and the next time step is considered as output 
variable [22]. For our work, we chose to apply the random forest 
and XGBOOST algorithms for the prediction. 


3.3.3 Deep learning models. Stacked LSTM: LSTM networks 

are a special type of RNN capable of learning from long-term de- 
pendencies. Stacking LSTM layers deepens the model as the up- 
per LSTM layer provides the sequence output to the lower LSTM 
layer instead of the single value output. We propose the following 
model: LSTM(n=100)(x3), dense(n=50), LSTM(n=50)(x2) and a final 
dense(n=1). We used early stopping with patience equals 25. For 
the hyperparameters tuning, randomSearch was parameterized in 
order to find the combination that returns the minimal MAPE er- 
ror. Hyperparameters: Learning_rate = 0.001, Batch_size = 128, 
Epochs = 61. 
Transformer-based model: The Transformer [24] is a new net- 
work architecture based on a "Self-attention" mechanism. It returns 
data we need by focusing only on significant features of a single 
sequence. We propose a model inspired by the Transformer. The 
model consists of Transformer encoder blocks (x8) that use the 
MultiHeadAttention layer (x8) as a self-attention mechanism ap- 
plied to the input data. The Transformer encoder block generates 
a batch_shape + (num_steps, features) tensor. This tensor is pro- 
cessed through a neural network (multilayer perceptron) and lastly 
through a dense layer with a relu activation function to produce 
the final output. We used early stopping with patience equals 25. 
RandomSearch was parameterized to in order to find the combi- 
nation that returns the minimal MAPE error. Hyperparameters: 
batch_size= 64, dropout= 0.3, epochs= 300, head_size= 128, hid- 
den_dim= 100, learning rate= 0.005. For the multilayer perceptron: 
mlp_dropout= 0.2, mlp_units=64. 


3.4 Interpretation 


Machine learning models remain largely black boxes. However, 
understanding the reasoning behind the predictions is very im- 
portant especially in a sensitive decision for industrial case. AI 
interpretability shows what is going on in these systems and helps 
identify potential problems and model errors. In our work, we used 
LIME [25] interpreter. It is based on finding independent descrip- 
tions for each instance by creating random samples in the area at 
regular intervals and weights according to the distance from the 
point of origin. The local description identifies which dimension 
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of the input is most responsible for the output of the neural net- 
work. The algorithm provides a linear explanatory model and can 
be plotted for visualisation. 


4 REAL CASE STUDY: EXPERIMENTATION 
AND RESULTS 


The experiment is based on a real case. ADDIXO has provided us 
with quality related data of plastic product manufacturing. In this 
work, a product’s quality prediction module is to be integrated to 
the ADDIXO Smart Factory solution. We used the libraries Keras 
and scikit-learn with a TensorFlow backend along side with pandas 
for tabular data processing. All visuals are produced with matplotlib 
and seaborn. 


4.1 Data exploration 


The dataset, illustrated in figure 2, contains 16 columns and 94528 
rows. The columns (1-15) represent process variables such as injec- 
tion time, pressure and volume. The output column represents the 
product’s quality evaluation of the manufactured product where 
class 1 = OK and class 0 = Not OK. We proceeded with the following 
analysis: first the data visualization, described in figure 3, showed 
normally distributed variables as the majority of data points are 
relatively similar. This helped us identify the presence of outliers. 
We used quantile method to remove all unwanted noise. Second, 
the experimental dataset doesn’t contain neither missing nor null 
or negative values. Third, a count of the output class showed an 
unbalance. Finally according to the Dickey—Fuller test results, the 
dataset is stationary at 95% level of confidence. 


date cpt.cye protocoles nb.total pieces cpt cyc.machine temps.cycle temps dosage temps injection pression com volume.com matelas val pointe integral 


Figure 3: Normal distribution 


4.2 Feature engineering 


4.2.1. Feature selection. In order to choose the optimal number 
of features for the experimental dataset, we proceeded with the 
comparison of the 3 methods presented in 3.2.1. New datasets 
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Table 1: Selected features with three methods 


Method Selected features | Accuracy | ROC_AUC score 
Original dataset | 16 0.9985 0.9827 
Filter 8 0.9984 0.9636 
Wrapper 15 0.9983 0.9504 
Embedded 12 0.9994 0.9955 


are generated with the given number of selected features. A KNN 
classifier is used to compare the new datasets along side with the 
original one. For the Classification task we evaluated our models 
with accuracy score and roc_auc score. We observe that in Table 1, 
the embedded method has scored the highest accuracy and roc_auc 
score. 


4.2.2 Data balancing. A count of the target columns showed an 
unbalance in our dataset. Minority class (319 instances) is almost 
133 time smaller than the majority class (42426 instances). We used 
imblearn.combine packages SMOTEENN and SMOTETomek. As 
their names show, SMOTEENN packages applies ENN undersam- 
pling after SMOTE oversampling. Same for SMOTETomek as it ap- 
plies applies tomekLink undersampling after SMOTE oversampling. 
To compare the balancing results, we used a KNN classifier with the 


following parameters: metric= ‘manhattan’, k=3 and weights='distance’. 


The results are summarized in Table 2. Smote + tomekLinks has 
achieved the highest roc_auc score (0.92) whereas Smote + ENN 
has achieved the highest accuracy (0.95). Based on [27], roc_auc 
score is a better measure than accuracy. And so, we proceed with 
the generated new dataset. Data balancing results: Class 0: 26102 
instances. Class 1: 55565 instances. 

At this stage, after concluding the best techniques in feature selec- 


tion (embedded method) as well as in data balancing (Smote+tomekLinks), 


we need to choose the best chronological order: (1) feature selection 
then data balancing or (2) data balancing then feature selection . 


Table 2: Data balancing results 


Method Accuracy | ROC_AUC score 
SMOTE + ENN 0.9566 0.8928 
SMOTE + tomekLinks | 0.9132 0.9293 


Table 3: (1) and (2) comparison results 


Approach | Selected features | Accuracy | ROC_AUC score 


(1) 10 0.9994 0.9998 
(2) 42 0.9956 0.8622 


Table 3 shows that the approach (1) scores the best results as it 
returns 10 features out of 16 with accuracy and roc_auc score very 
close to 1. We conclude that, for our case study, best approach is 
feature selection followed by data balancing. 
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4.3. Modeling 


Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and 
Mean absolute percentage error (MAPE) are calculated to evaluate 
the models’ forecasting results. Analysing table 4, we can conclude 
that the transformer-based model has achieved the best scores (2/3). 
This is because of its the ability to capture long-range dependencies 
and interactions. Plus since its design can allow parallel training of 
the data, the transformer-based model is more efficient than LSTM 
especially time wise. For our case, it is 11.32 times faster than the 
stacked LSTM. Random forest is a strong contender as it scored 
overall good values in a the minimal time of 2.89 seconds. 


Table 4: Forecasting results 


Model RMSE | MAPE | MAE | Time (s) 
ARIMA 2.864 0.067 2.043 2245.74 
XGBOOST 1.689 0.025 0.791 5.335 
RandomForest 1.615 0.023 | 0.724 2.891 
Stacked LSTM 1.464 0.027 0.838 1143.38 
Transformer-based | 1.548 0.023 | 0.719 101 


—— History 
X Tue Future 
@ Model Prediction 


Time-Step 


Figure 4: Transformer-based model’s predictions 


That’s why we chose the transformer-based model to generate 
a new dataset of predicted values. Figure 4 represents our best 
model’s prediction where the x axis represents the time steps and 
the y axis represents the values of the predicted column. Next we 
proceeded with the classification of this new data in order to check 
the quality of the product. We used a KNN classifier with the follow- 
ing parameters: metric= ’manhattan’, k=3 and weights=’distance’. 
Classification results: Dataset size: 12242, Class 0: 3558, Class 1: 
8684 


4.4 Interpretation 


LIME returns a group of weights explaining how the model clas- 
sified the random sample [25]. When the weight is positive, the 
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designated variable favors the classification and vice versa. Com- 
prehending how the model classified the sample helps us monitor 
the variables. Table 5 represents the recommended values to obtain 
class 1 that represents acceptable quality of the product. 


Table 5: Recommended values 


Variables Recommended values 
Cycle time 26.90 <x < 26.94 
Dosing time 4.01 <x < 4.03 
Injection time x < 3.48 
Switching pressure | 1347.00 < x < 1353.00 
Switching volume x < 16.50 


Mattress 
Peak value 


16.32 <x < 16.35 
1363.00 < x < 1369.00 
x > 142.00 
2710.00 < x < 2721.00 
32.82 <x < 32.85 


Integral value 
Total pressure 
Total volume 


To validate the obtained recommended values, two sets of data 
samples of the same size are fed into the architecture. The values of 
the first set are randomly collected, while the values of the second 
set were filtered to be in the range of the recommended values. The 
production of badly manufactured products (class 0) has decreased 
from 91.88% to 57.55% in the second set. 


5 CONCLUSIONS 


In this paper, we proposed an approach based on different AI models 
for product quality monitoring in an industrial context. Our cho- 
sen KPI is the percentage of correctly manufactured products. The 
data preprocessing phase showed a big unbalance between output 
classes. This inspired us to experiment two approaches for feature 
the engineering phase. The results have shown that feature selec- 
tion followed by data balancing returns the best scores accuracy 
and roc_auc wise. Next, we proceeded with time series forecasting. 
The originality of our work consists in using a transformer-based 
model within a time series prediction problem for industry 4.0. 
The evaluation of all the phases using the appropriate methods 
and metrics has shown good results, in terms of accuracy, MAPE, 
RMSE and the over all execution time, which proves the effective- 
ness of our proposal. The transformer-based model scored the best 
(rmse= 1.305, mape=0.23, mae=0.548) along side with random forest 
(rmse= 1.615, mape=0.23, mae=0.724) that appeared to be a serious 
contender to deep learning techniques for forecasting. LIME inter- 
preter was used to provide the recommended values. Our work is 
integrated as a module within ADDIXO’s smart factory system in 
order to make early decisions about the product’s quality. In our 
future work, we intend to experiment more industrial datasets with 
different variables in order to monitor other quality aspects. We 
also intend to use Temporal fusion transformer architecture as it 
has shown significant performance improvements over existing 
benchmarks. 
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ABSTRACT 


In an increasingly connected world, where information is easily 
spread through multiple channels and platforms, digital watermark- 
ing has been broadly investigated in authorship attribution and 
intellectual property protection of digital content. However, text 
contents pose many challenges due to a low capacity to embed a 
watermark. In this paper, we propose a new structural watermark- 
ing method for small pieces of text that may allow to hide upwards 
of 15 bits of watermark per character manipulating the underlying 
font grayscale values. The proposed method ensures length preser- 
vation and is robust to the copy and paste activities. Moreover, the 
method is able to embed a password-based watermark returning 
visually indistinguishable watermarked text. 
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- Human-centered computing — Social content sharing; Collab- 
orative content creation; «- Information systems — Data prove- 
nance; 
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1 INTRODUCTION 


In the last decades, we have witnessed a rapid spread of numerous 
cloud platforms that meet several kinds of users’ needs increasing 
the users’ sharing behaviour of images, videos and text documents. 
The great availability of millions of digital content has unlocked 
research and business activities in heterogeneous domains using 
methodologies that require high data availability, such as data min- 
ing [5] and information retrieval [6]. On the flip side, the increased 
circulation of data and information has exacerbated issues related 
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to data privacy and provenance also given the intellectual property 
rights that usually cover the digital content. 

A common approach for digital content protection implies the 
use of watermarking techniques [17]. In particular, watermarking 
a piece of media means embedding information into it with the 
explicit intent of preserving the copyright and eventually tracking 
the origin. Out of all digital content, watermarking a piece of text in- 
stead of, for example, an image or video rises many more challenges. 
In particular, the text has low embedding bandwidth, meaning that 
there’s much less room to embed the payload compared to an image, 
where every pixel can hide many bits of the watermark. Moreover, 
the text allows a restricted number of alternative syntactic and se- 
mantic permutations to preserve readability and original meaning 
[15]. In particular, text watermarking approaches can be classified 
into zero watermarking techniques, if some features of the text are 
stored on a third-party authority server; image-based techniques, if 
the text is transformed into an image and the watermark is embed- 
ded using an image watermarking method, for this reason, it cannot 
be considered a pure text watermarking method; syntactic and se- 
mantic techniques, that exploit the language depending features and 
grammar rules to embed the watermark; and structural techniques, 
that exploit structural and language-independent characteristics to 
embed the watermark. 

In this paper, we present a new pure text watermarking method 
for small pieces of text which nevertheless works adequately with 
long texts also. The proposed structural method is able to embed a 
password-based watermark preserving the length of the original 
text and returning visually indistinguishable watermarked text. 
The method consists of three phases, that is the generation, em- 
bedding and resolution. The watermark is generated by applying 
a hash function that combines the text and the user’s password. 
The embedding phase exploits the underlying font grayscale values, 
allowing to hide upwards of 15 bits of watermark per character. 
This represents our main contribution and improves the payload 
capabilities of state-of-the-art structural methods. In practice, the 
proposed method is to slightly change the black colour of every 
single character in the text using a shade of grey that is indistin- 
guishable to the user and devote the bits freed up by this change to 
embedding the watermark bits. To evaluate the threshold of grey 
to be used, we held a specific survey showing how watermarked 
text is visually indistinguishable to the majority of people. The 
resolution phase extracts the watermark from the text allowing 
some collateral actions, such as the text integrity verification and 
the provenance of the textual data, since the extracted watermark 
is inextricably traceable back to the author thanks to its password. 
The proposed method has many significant features. It ensures 
length preservation, meaning that it does not cause overhead to 
the original document. The grayscale threshold ensures that the 
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resulting watermarked text is visually indistinguishable compared 
to the original one. Finally, it significantly increases the payload ca- 
pability, which usually represents a weakness of text watermarking 
methods. Moreover, the proposed method can be programmatically 
applied to document editing software, such as Google Documents, 
Microsoft Word and LibreOffice Writer. 

The rest of the paper is structured as follows. In Section 2, we 
discuss previous works related to text watermarking methods. In 
Section 3, we present our text watermarking method based on font 
grayscale values, whereas the held survey is described in Section 
4. Results and limits are discussed in Section 5. Some concluding 
remarks are made in Section 6. 


2 RELATED WORKS 


In this section, we describe previous work in text watermark- 
ing, outlining drawbacks and limitations and excluding the zero- 
watermarking approach since it does not include a real watermark 
embedding phase. 

The image-based text watermarking transforms the text into an im- 
age and embeds the watermark by modulating the pixels’ luminance 
[4], the images histogram [11], or altering the inter-word spaces 
[10] and the characters’ strokes and serifs [1]. Image-based meth- 
ods reduce the text watermarking problem to the more researched 
scenario of image watermarking, actually making it unnatural and 
impractical in several contexts. 

In syntactic and semantic text watermarking, the Natural Language 
Processing techniques exploit the syntactic and semantic structure 
of the text to embed the watermark. In particular, these approaches 
apply clefting/passivization [2] or morpho-syntactic transforma- 
tions [12] or exploit the terms similarity [20] and nouns and verbs 
[18] to embed the watermark according to the sequence of bits. 
These methods make extensive modifications to the original text, 
producing a visibly different document and altering the author’s 
content. Moreover, they require long text to embed the whole wa- 
termark. 

Structural text watermarking is the most recent approach that ex- 
ploits the underlying structure of the text to embed the watermark. 
In particular, given the increased diffusion of the Unicode stan- 
dard in information systems, data hiding through Unicode transfor- 
mation is recently receiving more interest in exploiting different 
whitespaces encoding [14], invisible symbols [13] and homoglyph 
characters [16]. Unlike the syntactic and semantic methods, the 
structural approach preserves the text content. However, a small set 
of homoglyphs can be effectively used, and the text font impacts the 
visual indistinguishability. For this reason, we proposed a structural 
method grayscale-based able to ensure visual indistinguishability 
and length preservation of the original text. 

To the best of our knowledge, this is the first study that uses a font 
grayscale hue as a structural characteristic for text watermarking 
without transforming the text into an image as done in [4]. In the 
literature, the papers regarding grayscale have been focused mainly 
on grayscale recognition done by way of computer vision, focused 
on images [3]. Whenever grayscales are used for watermarking 
purposes, they are used to watermark images like in [8], or image- 
based approaches for text watermarking like in [7] where texts and 
images are watermarked together. 
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3 OUR METHOD 


In this section, we present our structural Grayscale Text Water- 
marking (GTW) method, which works by embedding watermark 
bits into the shades of grey of the characters in a text document. 
Grayscale is a palette of colours that starts with pure black and 
ends with pure white, usually represented as a series of hexadeci- 
mal codes (e.g., #171717 is a shade of grey very closely resembling 
pure black #000000). After the process is complete each character 
has a different shade of grey and, as such, stands for different parts 
of the original bit sequence of the watermark. 

The main problem that needs to be solved is that not all shades 
of grey are indistinguishable when confronted with pure black text. 
To solve this issue, GTW uses a two-step process to ensure maxi- 
mum bit embedding per character while maintaining the outcome 
invisible to the majority of people. The first step is thresholding the 
maximum possible shades of grey we use to embed a sequence of 
bits in a character. This reduces the number of bits we can embed 
in a single character but improves indistinguishability compared to 
using the whole palette of grey or the whole colour spectrum. We 
will provide more details about the threshold and users’ perceptions 
in Section 4. 

The second step is the separation of the hexadecimal components 
of the character’s colour (ie., red, green, blue). If we used the entire 
colour spectrum and a maximum threshold of #070707, there would 
be several colours represented with smaller numerically values that 
would be visually further away from black and more visible to the 
user (e.g., the numerical value #00ffff representing the colour yel- 
low is smaller than #070707). The solution is to consider a separate 
threshold of #07 for each colour component and then reconstruct 
the full colour to use for the character reassembling the compo- 
nents. 

As shown in Figure 1, the shade of each component encodes a por- 
tion of the sequence of bits of the watermark and helps to form 
the grey shade respecting the defined threshold. In particular, the 
grayscale is chosen converting a part of the watermark bits into a 
hexadecimal number lower than the current colour component’s 
threshold. Considering the previous #070707 threshold, the maxi- 
mum number of embeddable bits into a single colour component is 
three. The binary value 111 is equal to 07 in hexadecimal. 

The whole watermarking process consists of three phases. Firstly, 
in the watermark generation phase, a hash function combines the 
original text and the user’s password to generate the watermark bit 
sequence. Depending on the used hashing function, the watermark 
will have a different length. Then, the watermark is embedded in 
the original text characters by modulating the grey value according 
to the bit sequence of the watermark, as shown in Figure 1. For 


re, the world 


oe PARWaSs Aare 


LEE ES eee EER aa 


Figure 1: Embedding of a 63bits long watermark in the first 
seven characters of the string. 


167 


Grayscale Text Watermarking 


Algorithm 1 EMBEDDING 


IDEAS’22, August 22-24, 2022, Budapest, Hungary 


Algorithm 2 VALIDATION 


1: // The SHA256 hashing algorithm returns a 256bits watermark. 
2: function EMBEDDING(orgT ext, user Password) 
3: watermark — SHA256(orgText, userPassword) 


4 threshold — #272727 

5 for each char € orgText do 

6: redBits — toGreyShade(pop(watermark, 3)) 

7: greenBits — toGreyShade(pop(watermark, 3)) 
8 blueBits — toGreyShade(pop( watermark, 3)) 

9 colour < redBits + greenBits + blueBits 

10: wtmText — wtmText + colouring(char, colour) 
11: return wimText 


instance, using #070707 as a threshold value, that is #07 for each 
colour component (ie., red, green, blue), we can encode 3 bits of the 
watermark in each component embedding 9 bits on each character. 
Moreover, by embedding the watermark using grayscale manipula- 
tion, no overhead data is added to the original text, thus ensuring 
length preservation. The third phase is the watermark validation, 
which goal is to verify whether a user who claims ownership of the 
document is the actual author. The proposed method belongs to 
the blind text watermarking method class [15], which means that 
the watermark can be extracted without the original text. Because 
the original text is supposed to use black colour characters, while 
our watermarked text contains also characters with shades of grey, 
the watermark can be extracted from the watermarked text using a 
reverse colouring function. In practice, the extracted watermark is 
compared with the watermark generated by combining the original 
text, which is obtained by cleaning the watermarked text, and the 
password of the claiming user. Another consequence of the pro- 
posed approach is that the watermark is invisible, not readable and 
detectable [15]. Hence, it is sufficient to examine the colour of each 
character to determine whether it is pure black or a shade of grey. 

The pseudo-code of the embedding and validation algorithms 
are presented in Algorithms 1 and 2, respectively. We used the 
SHA256 (Secure Hash Algorithms) hashing function that returns a 
256bits long watermark. If the user has less stringent security needs, 
this function can be replaced with algorithms that generate shorter 
sequences. Whereas the #272727 threshold allows embedding 5 bits 
in each colour component, this means that the SHA256 digest can 
be embedded in 18 characters. Both the embedding and validation 
algorithms work at the character level, embedding and recovering 
the bits sequence of the watermark in the colour components of the 
characters. The cost of the two algorithms is linear on the number 
of characters in the text in the worst-case scenario. It is worth notic- 
ing that the proposed method is both languages independent and 
not bound to any platform. In other words, the method can be inte- 
grated into any text editing software that permits programmatically 
manipulating a single character’s colour freely. 


4 GRAYSCALE PERCEPTION SURVEY 


In this section, we briefly present the survey through which we 
identified the grayscale threshold to be used to ensure the visual 
indistinguishability of the watermarked text. 
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1: // This function wipes out the watermark. 
2: function CLEAN(wimText) 


3: for each char € wtmText do 

4: text — text + colouring(character, #000000) 

5: return text 

6: 

7: // The extracted watermark is compared with the new one. 
8: function VALIDATION(witmText, newPassword) 

9: orgText — clean(wtmText) 
10: newWtm <— SHA256(orgText, newPassword) 
11: for each char € wtmText do 

12: colourString — string(getColor(char)) 

13: red, green, blue — extractComponent(colourString) 
14: recoveredWtm < recoveredWtm + red + green + blue 
15: return (newWtm == recoveredWtm) ? True : False 


We devised a survey to study how people perceive different 
shades of grey in text. In particular, we administered the survey 
to 255 people (114 females and 111 males) covering various age 
groups: 13.3% under 19 years old, 29.3% between 20 and 29, 7.6% 
between 30 and 39, 12.0% between 40 and 49, 30.2% between 50 and 
59, 6.7% between 60 and 69 and 0.9% over 70 years old. 

The survey is designed in two parts!. The first one consists of 
12 questions with pairs of squares shown side-by-side, in which 
randomly one of the two squares is always pure black (#000000 
BB) while the other changes from question to question at regular 
intervals to a lighter shade of grey from #070707 J to #5f5f5f 
[. We then asked the participants whether they thought the two 
squares are both coloured black. The assumption is that if a user 
recognises one shade of grey as different from black, he/she will 
also identify all the lighter grey nuances. These preliminary results 
using squares made it possible to reduce the threshold search space. 
Figure 2 shows that more than 70% of people can correctly identify 
the difference between #272727 [| and #000000 i. In other words, 
the results suggest that we should aim for a lower maximum hue 
to minimise the watermark’s visibility. 


100% + 96.00% 96.89% 
90.67% 


97.33% 96.44% 


84.00% 92.89% 
715% + 72.44% 


63.56% 


26.22% 


s ¢ 


Fo 
%° 


Figure 2: The percentage of users who detect the two squares 
as different as the shade of grey changes. 


1The survey dataset is available at this link: https://tinyurl.com/22s6nvem. 
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Alice cominciava a sentirsi mortalmente stanca di sedere sul poggio, accanto sua 

—_ ee ee oe —=<= a 
sorella, senza far nulla: una o due volte aveva gittato lo sguardo libro che leggeva ma 
me == -—s =o == oo =a ee = 


non c'erano imagini né dialochi, "e serve un libro," pensd Alice, "senza e dialoghi?" 
— oe eee 


Figure 3: An example of text shown in the second part of the 
survey. The non-black letters have been underlined in red. 


The second part of the survey is designed to investigate the 
visibility of different shades of grey applied to written text. In 
particular, participants were asked to identify the first word in 
a series of six texts where they notice a different coloured letter, 
namely a letter with a different shade of grey. The texts are written 
in pure black and then some letters are recoloured in progressively 
lighter shades of grey, moreover, the texts vary in font size and 
boldness to mimic a good variety of real case scenarios. Figure 3 
shows an example of text presented to participants. Each word in 
the text appears uniquely to be able to correctly identify the user’s 
choice. To make it easier for the reader to identify letters with a 
different shade of grey, they have been underlined here in red. 

Figure 4 shows the aggregated results of the second part. It is 
worth noticing that, across all font sizes, there is a sharp increase 
in recognition after the shade of grey #272727 [J], confirming that 
perception is different when shades of grey are applied to the text 
and allows higher threshold values. In particular, using a threshold 
value of #070707 [J only 3.7% of the survey’s population correctly 
identifies grayscale variations, the percentage rising to 9.08% with 
the threshold value of #272727 [J]. This is an acceptable range of 
detection with thresholds that still allow us to hide a large number 
of bits in a short text. 


5 RESULTS AND DISCUSSION 


In this section, we discuss the crucial aspects concerning the pro- 
posed method, such as limits to preserve visual indistinguishability, 
performances in terms of payload capability and robustness of 
the watermark. In particular, we tested the implementation of the 
proposed method by writing a Google Documents add-on that au- 
tomatically watermarks and verifies text documents of the users. 

Visual Indistinguishability at the Edges - In some borderline 
cases of the sequences of bits, the visual indistinguishability of the 


100% + 


15% 7 


@ Text | 
A Text 2 
© Text 3 
@ Text 4 
* Text 5 


25% + 
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Figure 4: Success identification rate vs shade of grey for each 
text from smallest (Text 1) to largest font size (Text 5). 
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watermark may be invalidated even with more restrictive threshold 
values. In particular, some unfortunate combinations of watermark 
bits can result in hexadecimal code values corresponding to more 
visible shades of the primary colours (i-e., red, green, blue) and not 
to shades of grey. For instance, the bits sequence 111110000000000 
000001111100000 000000000011111 corresponds to three hexadeci- 
mal colour values to be assigned to three consecutive letters com- 
pletely unbalanced among the colour components. In particular, 
the predominance of one of them in each chunk (e.g., the red com- 
ponent in the first of the three chunks) potentially results in a 
more noticeable text. Establishing the distance between colours is 
challenging and beyond the scope of this study, so we took a more 
conservative approach by reducing the number of bits that can be 
accommodated in each character. In particular, we imposed that the 
embeddable sequence of bits allowed by the threshold is repeated 
for all components. This ensures that every subsequence of bits of 
the watermark creates a shade of grey. 

Embedding Performance - To evaluate the embedding perfor- 
mance with the state-of-the-art methods, we also selected a struc- 
tural text watermarking method based on homoglyphs [15], which 
is a good benchmark in terms of payload and visual indistinguisha- 
bility. The Homoglyph-based Watermarking (HBW) method em- 
beds the watermark by swapping some characters and whitespace 
with characters that look the same but have a different Unicode 
representation. In particular, we used the New York Times Corpus” 
consisting of 1.8 million articles from the New York Times newspa- 
per spanning from 1987 to 2007. We focused on the lead paragraph 
stressing the fact that the proposed method can be successfully 
applied to a short portion of texts. The GTW is evaluated using two 
different threshold values, that is #070707 and #272727. Whereas, 
since the hash functions ensure different levels of security, we used 
three different hash functions (e.g., SipHash, MD5 and SHA256) 
that provide different levels of security and generate sequences of 
bits of increasing length. Table 1 shows how many characters are 
needed on average to embed varying lengths of watermark bits 
sequence. The proposed GTW method with the most restrictive 
threshold value #070707 outperforms the HBW method. In partic- 
ular, GTW is able to embed the longest sequence of bits (256bits) 
using 15 fewer characters than HBW when used to embed the short- 
est sequence of bits (64bits). Whereas the GTW method with the 
threshold value #272727 improves the embedding capabilities of the 
HBW method by 86.48% on average, showing the effectiveness of 
our approach. It is worth noticing that the GTW and HBW methods 
are not mutually exclusive and can be combined to achieve a higher 
embedding rate. As shown in Table 1, the combination requires just 
46 characters to embed a 256bits long watermark, corresponding to 
5 more characters than the table caption. 

Robustness - In this section, we discuss the robustness of the pro- 
posed method against the most common attacks. Taking advantage 
of embedding capability and the ability to repeatedly embed the 
watermark in the text make the proposed method robust against 
insertion and deletion attacks. For instance, until there are 52 unal- 
tered consecutive characters there will be at least one copy of the 
256bits long watermark to retrieve in the text. The average word 
length in New York Times articles is 4.9 characters, this means that 


2https://catalog.Idc.upenn.edu/LDC2008T19 
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Table 1: Comparison results with state-of-the-art methods. 


Required characters for: 
Method (threshold) 64 bits | 128 bits | 256 bits 
GTW (#070707) 22 43 86 
GTW (#272727) 13 26 52 
HBW [15] 101 198 357 
GTW (#272727) + HBW [15] 12 23 46 
Khosravi et al. [9] 2,133 4,267 8,533 
Por et al. [14] 199 399 798 
Taleby A. et al. [19] 1,016 2,032 4,064 


the most robust watermark can be hidden in 18 words using the 
most stringent threshold, much less than the limit set by UK govern- 
ment best practice?. In other words, insertion and deletion attacks 
would require heavy changes to the watermarked text to effectively 
remove the mark leading to a completely different text with no 
longer any connection to the original text. The partial copy&paste 
attack is quite common and consists of an attacker copying and 
pasting part of the watermarked text violating the copyright of the 
author. Most structural watermarking techniques fail to protect 
against this type of attack. In [14], just 0.01% of the watermark is 
preserved against copy&paste. Whereas the GTW method performs 
excellently and, with 13 characters needed for embedding a 64bits 
long watermark, protects text at sentence/word level, making the 
partial copy&paste attacks ineffective. The text replacing attack 
can be considered a variant of the previous ones. In particular, the 
proposed approach outperforms other structural methods which 
shows a 2.6% success rate against this type of attack [9]. As with 
any other structural or image-based method, the GTW is vulnerable 
to manual retyping and letter recolouring attacks since producing 
new text by copying or recolouring characters completely wipes 
out the watermark. However, a partial recolouring attack can be 
considered equivalent to a replacement attack. It is worth noticing 
that the GTW method does not limit the user’s ability to use differ- 
ent colours in the text document, for instance writing important 
words in red or blue. In particular, the solution is similar to the cur- 
rent method and involves the use of the shades of the colour used 
by the user to embed the watermark or, more simply, skip those 
words. Another important aspect concerning text watermarking 
is portability. In particular, we copied and pasted a watermarked 
piece of text to and from some of the major text editing software: 
Microsoft Word, LibreOffice Writer and Apache OpenOffice Writer. 
The results showed that all copy and paste attempts preserved the 
watermark since all of these software use the same hexadecimal 
scale to represent colour and share a protocol to support copy and 
paste action preserving the text formatting. 


6 CONCLUSIONS 


The provenance detection and intellectual property protection of 
digital content have become a challenging research problem, espe- 
cially due to the increasingly widespread use of sharing platforms. 
In this paper, we proposed a structural text watermarking method 
that works by potentially embedding up to 15 bits into every single 


$https://www.gov.uk/guidance/content-design/writing-for-gov-uk 
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character changing the font colour from pure black into a grayscale 
hue chosen to be invisible to the human eye. We demonstrated that 
the shortest length to embed a 64-bit watermark is only 13 charac- 
ters. The strengths of the proposed method include the preservation 
of the length of the original text, the guarantee that the change 
is visually indistinguishable, and the fact that the syntactic and 
semantic structure of the original text is unchanged. The proposed 
method, as well as being independent of the language used in the 
text and the user’s preferred font, is fully portable to every major 
text editing software currently available. 
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ABSTRACT 


Front-back vowel harmony is an important characteristic of many 
languages. Testing whether an untranslated script has vowel har- 
mony may aid its decipherment. This paper tests vowel harmony 
for three different modern languages (Hungarian, Spanish and Turk- 
ish) as well as the extinct underlying language of the undeciphered 
Indus Valley script. We also introduce a novel vowel harmony in- 
dex based on the Exponential Random Graph Model for graphs. 
To achieve this, we first select words from each of the modern 
languages (Hungarian, Turkish, and Spanish) from their Swadesh 
list. Then we divide each word into syllables, isolating the vowels. 
We then analyze the three modern languages using Exponential 
Random Graph Model methods. The results indicate that this proce- 
dure and the vowel harmony index are feasible to define the degree 
of vowel harmony in a language. The procedure is then extended 
to the undeciphered Indus Valley Script. Our results indicate that 
the underlying language of the Indus Valley Script also had vowel 
harmony. We found that on average the odds of the IVS having 
vowel harmony were 6.61 times higher than would be found in a 
random graph. 
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1 INTRODUCTION 


In some languages the vowels within words tend to be paired with 
each other if they are formed at the same area of the mouth. For 
example, in English the following vowels are formed at the back 
of the mouth: a, o, u, while the following vowels are formed at the 
front of the mouth: e, i. For example, banana has only back vowels, 
while cherry has only front vowels. A strong tendency towards 
front-back vowel harmony is common in Turkic and Uralic but 
is infrequent in Indo-European languages [1]. Front-back vowel 
harmony is assumed to have been a feature already in the Proto- 
Uralic language [4, 16] and can also be detected in the extinct 
Minoan language [9]. 

In this paper, we propose a new exponential random graph model 
(ERGM)-based measure for the degree of vowel harmony in lan- 
guages and use that measure to compare three modern languages 
(Hungarian, Spanish, and Turkish) from three different language 
families (Uralic, Indo-European, and Turkic, respectively) and the 
underlying language of the Indus Valley Script, which is still con- 
sidered undeciphered. 

The rest of this paper is organized as follows. First, in the next 
section, we describe the data sources. Second, in the following 
section, we explain the use of simulated annealing for the Indus 
Valley Script. Third, we describe the exponential random graph 
model (ERGM) method [13] and proposes the odds ratio coefficient 
for ‘nodematch’ in an ERGM fit as a vowel harmony index. Fourth, 
we present and discuss the experimental results. Since the presence 
of front-back vowel harmony within the three modern languages 
(Hungarian, Spanish, and Turkish) has been already known, the 
focus is to answer the question of where the underlying language 
of the Indus Valley Script fits in. Fifth, in the last section, we give 
some conclusions and directions for future work. 


2 DATA SOURCES 


For Hungarian, Spanish and Turkish, we started with their Swadesh 
lists, which contained the 207 most basic words in those languages. 
Hungarian words were hyphenated by a native speaker, who also 
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Table 1: Statistics about the words considered and selected for the analysis. 


Language Words Considered Multisyllabic Root Words 
Hungarian Swadesh list (207 words) 87 
Indus Valley Script ICIT inscriptions 61 
Spanish Swadesh list (207 words) 156 
Turkish Swadesh list (207 words) 82 


identified and removed some suffixes. Spanish words were hyphen- 
ated by a tool called ‘Hypenator.net’ available at the Internet, and 
Turkish words were hyphenated using Wikipedia. The Spanish and 
the Turkish Swadesh list words were assumed to be root words. For 
the Indus Valley Script (IVS), we assumed that each sign represents 
a syllable. From the Interactive Corpus of Indus Text (ICIT) database 
of Indus Valley Script inscriptions of Well and Fuls [15], we selected 
those putative multisyllabic root words that appeared in at least 
four inscriptions. Table 1 shows the number of multisyllabic root 
words used in the analysis. 


3 SIMULATED ANNEALING 


For the Indus Valley Script, we use a simulated annealing process to 
determine the most likely front/back label for each node. Each node 
was randomly assigned a front or back designation. We, then, went 
through and randomly selected 200 nodes, and evaluated the node’s 
neighbors. If it had more front vowel neighbors, then the symbol 
was changed to also be a front vowel. If the node had more back 
neighbors, it was also changed to a back vowel. Since there were 61 
distinct symbols some nodes were evaluated more than once. After 
the 200 changes were made, a graph was made consisting of the 
probable front/back distributions. One example graph can be found 
in the upper right part of Figure 1. 


4 EXPONENTIAL RANDOM GRAPH MODEL 
ANALYSIS 


For each language, we create a graph where each node is a syllable, 
and each edge means that there is at least one root word which 
contains the syllables associated with the nodes that are connected. 
The nodes of the graph are labeled as belonging to front or back 
categories depending on whether the vowel in the syllable a front 
or a back vowel is. 

The graph is analyzed using an exponential random graph model 
(ERGM) tool called STATNET and ERGM packages in R program- 
ming with a parameter for edges and ‘nodematch’ [13]. These cal- 
culate a coefficient that represents the degree to which the graph 
possesses more edges between nodes that have matching labels 
compared to a random graph. We call this coefficient the vowel 
harmony index. 


5 EXPERIMENTAL RESULTS AND 
DISCUSSION 

Our experiment had two main steps. First, we created a dataset 

for each language. Second, we created graphs for the words of 

each language and find the vowel harmony index. The results are 

summarized in Table 2. 
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Table 2: The odds ratio, the log odds, and the significance 
value (P of .001 means 95 percent confidence) according to 
the ERGM analysis. 


Language Log odds Odds ratio Significance 
Hungarian 1.832 6.245 P< .001 
Indus Valley Script 1.889 6.61 P< .001 
Spanish -0.1562 0.855 0.231 
Turkish 1.855 6.389 P< .001 


For Hungarian, the graph has one main component of front 
vowels and many smaller components of either only front vowels 
or only back vowels (upper left of Figure 1). Hence this graph 
indicates a strong vowel harmony. 

For the underlying language of the Indus Valley Script, the graph 
has one large component and two small components with three 
and two nodes only. After simulated annealing tried to optimize 
the node labeling as back or front syllabic, the large component 
contained about the same number of front and back labeled nodes 
(upper right of Figure 1). Hence the underlying language of the IVS 
does not appear to have vowel harmony. 

For Spanish, the graph has one large cluster with many front and 
back vowels (lower left of Figure 1). While some of the lone pairs 
appear to be matched on front/back vowels, there is clear evidence 
suggesting that Spanish does not have vowel harmony. 

For Turkish, the graph has two main components (lower right of 
Figure 1). One component has only front vowels, while the other 
component has mostly back vowels. Hence this graph shows a 
strong degree of front-back vowel harmony. 

In addition to the visual analysis, an ERGM analysis also calculated 
the log odds and odds ratios for the four languages. The odds ratio is 
the likelihood of each node having edges with a node that matched 
in the front/back label. This coefficient is our vowel harmony index. 
Its log is called the log odds, which is an alternative measure of 
vowel harmony. 

For Hungarian, nodes were 6.245 times more likely to have edges to 
other nodes with the same front/back label, compared to a random 
graph. The same result can be interpreted as nodes having 83.2 
percent higher odds of connecting to other nodes with the same 
front/back label. This was a significant result. 

For the underlying language of the IVS, after 2000 iterations of 
simulated annealing performed repeatedly, the odds ratios ranged 
from a high of 26.90 to a low of 2.22 with an average of 6.61 of nodes 
matching in front/back label compared to a random graph. These 
results suggest that there is a high likelihood that the underlying 
language of the IVS has vowel harmony. 
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Figure 1: ERGM analysis of Hungarian (upper left), the underlying language of the Indus Valley Script (upper right), Spanish 
(lower left), and Turkish (lower right). Syllables with back vowels (white), syllables with front vowels (orange). 


For Spanish, nodes had about 15.62 percent higher odds of connect- 
ing to other nodes of the opposite front/back label. These were not 
significant results with P = 0.231, suggesting that this relationship 
was about the same as would be found in a random graph. 

For Turkish, nodes were 6.389 times more likely to have edges to 
other nodes with the same front/back label compared to a random 
graph. This was also a significant result. 

Comparing the relative strength of vowel harmony in the four 
languages, we see that the underlying language of the Indus Valley 
Script has the highest vowel harmony index (6.61), Turkish is second 
(6.389), Hungarian is third (6.245). The difference in the vowel 
harmony index is small among these three languages. This indicates 
that the three languages all have strong front-back vowel harmony. 
Spanish had the smallest vowel harmony index (0.855), which is 
close to one, which is a value expected for random graphs. Hence 
the vowel harmony index suggests that Spanish does not have 
vowel harmony. 
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5.1 Comparison with the Minoan Language 


The Minoan civilization was a contemporary of the Indus Valley civ- 
ilization in the Bronze Age. Both civilizations left some inscriptions 
that attracted the attention of many scholars, who tried to decipher 
them. Decipherment of a script is facilitated by knowing the major 
characteristics of the underlying language. The presence or absence 
of front-back vowel harmony is one such characteristic. If we can 
identify several common characteristics between the Indus Valley 
and the Minoan languages, then the probability is high that the two 
languages were relatives, that is, belonged to the same language 
family. 

The surprising result of this paper is that the underlying language 
of the Indus Valley Script seems to also have front-back vowel har- 
mony. Hence the closest relative of the still undeciphered language 
of the Indus Valley Script is likely found among those languages that 
have front-back vowel harmony. Vowel harmony of the underlying 
language of the Minoan scripts was studied in Revesz [9], which 
showed that the Phaistos Disk had front-back vowel harmony. The 
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Phaistos Disk is the longest Minoan inscription with a total of 241 
signs that are stamped left-to-right in spiral form [11]. That the 
Indus Valley and the Minoan languages both had front-back vowel 
harmony suggests that the two languages may have been related. 
There are other interesting connections between the Indus Valley 
and the Minoan civilizations. Tsafou and Garcia-Granero [14] found 
that the Minoans used cumin (Cuminum cyminum) that originated 
from the Indus Valley civilization. Ialongo et al. [5] argued that 
Near Eastern traders were the middleman in the trade between the 
Indus Valley and the Minoan civilizations. However, Revesz, a data 
scientist, reanalyzed the weight unit data of Ialongo et al. [5] and 
showed that there was direct trade between the Indus Valley and the 
Minoan civilizations [12]. Revesz [12] also showed that there was a 
direct trade route between the Indus Valley and the Old European 
cultures, and the Minoan and Old European cultures too. This 
triangular connection raises the possibility that the Indus Valley, 
the Minoan, and the Old European cultures have some linguistic 
connections. Minoan and Hungarian belong to the same branch of 
the Uralic language family based on decipherment of some Minoan 
inscriptions [7] and regular sound changes [10], and their common 
ancestor may have been the language of the Old European culture. 
Similarities between Hungarian folk songs and Sanskrit literature 
also support the triangular connection among the three regions [8]. 
The similar motifs found in the Sanskrit literature may derive from 
the Indus Valley civilization [6]. 

Daggumati and Revesz [2] found the Indus Valley script to be closest 
to the Sumerian pictograms among a set of ancient scripts. However, 
an interesting similarity between the Indus Valley script and the 
Minoan Linear A script is that they both contain many allographs 
[3]. The direct trade between the two civilizations and this aspect 
of their scripts further increases the probability that these two 
civilizations spoke a similar language. 


6 CONCLUSIONS AND FUTURE WORK 


The ERGM odds ratio coefficient using ‘nodematch’ seems to be 
a suitable vowel harmony index. The vowel harmony index gives 
results that match expectations regarding several modern languages 
from three different language families. 

The closest relative of the still undeciphered language of the 
Indus Valley Script is likely found among those languages that 
have front-back vowel harmony, and it may turn out to be the 
Minoan language. This recognition could direct future work on 
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the decipherment of the Indus Valley Script together with other 
significant analyses of the structure of the Indus Valley Script signs 
and inscriptions. 
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